Determining action selection policies of an execution device

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on computer storage media, are described, for generating an action selection policy of an execution device for completing a task in an environment. The method includes, in a current iteration, computing a counterfactual value (CFV) of the execution device in a terminal state based on a payoff of the execution device and a reach probability of other devices reaching the terminal state; computing a baseline-corrected CFV of the execution device in the terminal state; for each non-terminal state having child states, computing a CFV of the execution device in the non-terminal state based on a weighted sum of the baseline-corrected CFVs of the execution device in the child states; computing a baseline-corrected CFV and a CFV baseline of the execution device in the non-terminal state; and determining an action selection policy in the non-terminal state for the next iteration.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of PCT Application No.PCT/CN2019/124933, filed on Dec. 12, 2019, which is hereby incorporatedby reference in its entirety.

TECHNICAL FIELD

This specification relates to determining action selection policies foran execution device for completing a task in an environment thatincludes the execution device and one or more other devices.

BACKGROUND

Strategic interaction between two or more parties can be modeled andsimulated by a game that involves two or more parties (also referred toas players). In Imperfect Information Games (IIG) that involve two ormore players, a player only has partial access to the knowledge of heropponents before making a decision. This is similar to real-worldscenarios, such as trading, traffic routing, and public auction. Manyreal life scenarios can be represented as IIGs, such as commercialcompetition between different companies, bidding relationships inauction scenarios, and game relationships between a fraud party and ananti-fraud party.

Due to the hidden information, a player has to make decisions withuncertainty about her opponents' information, and she also needs to actso as to take advantage of her opponents' uncertainty about her owninformation. Solving an IIG can be computationally expensive and timeconsuming, especially for large games that has a large number ofpossible states and possible actions to choose. Techniques for solvingan IIG in an efficient manner are desirable.

SUMMARY

Described embodiments of the subject matter can include one or morefeatures, alone or in combination.

For example, in one embodiment, a computer-implemented method of anexecution device for generating an action selection policy forcompleting a task in an environment that includes the execution deviceand one or more other devices, the method comprising: in a currentiteration of a plurality of iterations, computing a counterfactual value(CFV) of the execution device in a terminal state of completing a taskbased on a payoff of the execution device at the terminal state and areach probability of the one or more other devices reaching the terminalstate, wherein the terminal state results from a sequence of actionstaken at a plurality of non-terminal states by the execution device andby the one or more other devices, wherein each of the plurality ofnon-terminal states has one or more child states; computing abaseline-corrected CFV of the execution device in the terminal statebased on the CFV of the execution device in the terminal state, a CFVbaseline of the execution device in the terminal state of a previousiteration, or both; for each of the non-terminal states and startingfrom a non-terminal state that has the terminal state and one or moreother terminal states as child states: computing a CFV of the executiondevice in the non-terminal state based on a weighted sum of thebaseline-corrected CFVs of the execution device in the child states ofthe non-terminal state; computing a baseline-corrected CFV of theexecution device in the non-terminal state based on the CFV of theexecution device in the non-terminal state, a CFV baseline of theexecution device in the non-terminal state of a previous iteration, orboth; computing a CFV baseline of the execution device in thenon-terminal state of the current iteration based on a weighted sum ofthe CFV baseline of the execution device in the non-terminal state ofthe previous iteration and the CFV or the baseline-corrected CFV of theexecution device in the non-terminal state; and determining an actionselection policy in the non-terminal state for the next iteration basedon the baseline-corrected CFV of the execution device in thenon-terminal state of the current iteration.

In some embodiments, these general and specific aspects may beimplemented using a system, a method, or a computer program, or anycombination of systems, methods, and computer programs. The foregoingand other described embodiments can each, optionally, include one ormore of the following aspects:

In some embodiments, in response to determining that a convergencecondition is met, operations of the execution device in the non-terminalstate are controlled based on the action selection policy in thenon-terminal state for the next iteration.

In some embodiments, determining an action selection policy in thenon-terminal state for the next iteration based on thebaseline-corrected CFV of the execution device in the non-terminal stateof the current iteration comprises: calculating a regret value based onthe baseline-corrected CFV of the execution device in the non-terminalstate of the current iteration; and determining an action selectionpolicy in the non-terminal state for the next iteration based on theregret value according to regret matching.

In some embodiments, the reach probability of the one or more otherdevices reaching the terminal state comprises a product of probabilitiesof actions taken by the one or more other devices reach the terminalstate.

In some embodiments, computing a baseline-corrected CFV of the executiondevice in the non-terminal state based on the CFV of the executiondevice in the non-terminal state, a CFV baseline of the execution devicein the non-terminal state of a previous iteration, or both comprises:computing a sampled CFV baseline of the execution device that takes theaction in the terminal state of the previous iteration based on the CFVbaseline of the execution device in the terminal state of the previousiteration, a sampling policy of the execution device that takes theaction in the terminal state of the previous iteration, and aprobability of reaching the terminal state results from a sequence ofactions taken by the execution device; in response to determining thatthe action is sampled, computing a baseline-corrected CFV of theexecution device that takes the action in the non-terminal state basedon the CFV of the execution device in the non-terminal state and thesampled CFV baseline of the execution device that takes the action inthe terminal state of the previous iteration; and in response todetermining that the action is not sampled, using the sampled CFVbaseline of the execution device that takes the action in the terminalstate of the previous iteration as the baseline-corrected CFV of theexecution device in the non-terminal state.

In some embodiments, the weighted sum of the baseline-corrected CFV ofthe execution device in the terminal state and correspondingbaseline-corrected CFVs of the execution device in the one or more otherterminal states is computed based on the baseline-corrected CFV of theexecution device in the terminal state and correspondingbaseline-corrected CFVs of the execution device in the one or more otherterminal states weighted by an action selection policy in thenon-terminal state in the current iteration.

In some embodiments, the weighted sum of the CFV baseline of theexecution device in the non-terminal state of the previous iteration andthe CFV or the baseline-corrected CFV of the execution device in thenon-terminal state comprises a sum of: the CFV baseline of the executiondevice in the non-terminal state of the previous iteration weighted by ascalar; and the CFV or the baseline-corrected CFV of the executiondevice in the non-terminal state weighted by a second scalar and aprobability of considering the non-terminal state.

It is appreciated that methods in accordance with this specification mayinclude any combination of the aspects and features described herein.That is, methods in accordance with this specification are not limitedto the combinations of aspects and features specifically describedherein but also include any combination of the aspects and featuresprovided.

The details of one or more embodiments of this specification are setforth in the accompanying drawings and the description below. Otherfeatures and advantages of this specification will be apparent from thedescription and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A and 1B are diagrams illustrating examples of a game tree and apublic tree of Kuhn Poker in accordance with embodiments of thisspecification.

FIG. 2 is a log-log plot illustrating convergence performances ofseveral MCCFR variants applied to NLPH with different sampling policiesin accordance with embodiments of this specification.

FIG. 3 is a log-log plot illustrating convergence performances ofseveral MCCFR variants applied to NLPH with and without explorationtechniques in accordance with embodiments of this specification.

FIG. 4A is a log-log plot illustrating convergence performances ofseveral MCCFR variants applied to NLPH with and without differentvariance reduction techniques in accordance with embodiments of thisspecification.

FIG. 4B is a log-log plot illustrating example computationalefficiencies of several MCCFR variants applied to NLPH with and withoutdifferent variance reduction techniques in accordance with embodimentsof this specification.

FIGS. 5A-5C are log-log plots illustrating convergence performances ofseveral MCCFR variants by external sampling on three different pokergames: NLPH, HUNL-R, and NLFH, in accordance with embodiments of thisspecification.

FIG. 6A is log-log plot illustrating convergence performances of severalMCCFR variants with and without skipping on NLPH in accordance withembodiments of this specification.

FIG. 6B is log-log plot illustrating convergence performances of MCCFRvariants with and without skipping on NLPH in accordance withembodiments of this specification.

FIG. 7 is a flowchart of an example of a process for performing MonteCarlo counterfactual regret minimization (MCCFR) for determining actionselection policies for software applications in accordance withembodiments of this specification.

FIG. 8 is a flowchart of an example of another process for performingMonte Carlo counterfactual regret minimization (MCCFR) for determiningaction selection policies for software applications in accordance withembodiments of this specification.

FIG. 9 is a flowchart of an example of another process for performingMonte Carlo counterfactual regret minimization (MCCFR) for determiningaction selection policies for software applications in accordance withembodiments of this specification.

FIG. 10 depicts a block diagram illustrating an example of acomputer-implemented system used to provide computationalfunctionalities associated with described algorithms, methods,functions, processes, flows, and procedures in accordance withembodiments of this specification.

FIG. 11 depicts examples of modules of an apparatus in accordance withembodiments of this specification.

FIG. 12 depicts examples of modules of another apparatus in accordancewith embodiments of this specification.

FIG. 13 depicts examples of modules of another apparatus in accordancewith embodiments of this specification.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

This specification describes techniques for determining an actionselection policy for an execution device for completing a task in anenvironment that includes the execution device and one or more otherdevices, for example, for strategic interaction between the executiondevice and the one or more other devices. For example, the executiondevice can perform a computer-implemented method for searching for aNash equilibrium of a game between the execution device and one or moreother devices, and obtain an action selection policy (e.g., a solutionor strategy) that leads to Nash equilibrium or approximate Nashequilibrium. In some embodiments, these techniques can involveperforming a counterfactual regret minimization (CFR) algorithm forsolving an imperfect information game (IIG). In some embodiments, thetechniques can reduce the computational complexity and variance whileimproving the convergence speed of the CFR algorithm.

An IIG can represent one or more real-world scenarios such as autonomousvehicle (AV) control, resource allocation, product/servicerecommendation, cyber-attack prediction and/or prevention, trafficrouting, fraud management, trading, bidding, etc. that involve two ormore parties (also referred to as players) where each party may haveincomplete or imperfect information about another party's decisions.This specification uses Poker as an example of an IIG. The describedtechniques can be used in many other artificial intelligence (AI) andmachine learning applications.

The typical target of solving an IIG is to find a Nash equilibrium sothat no player can unilaterally improve the reward. In other words, aNash equilibrium is a typical solution for an IIG that involves two ormore players. Counterfactual Regret Minimization (CFR) is an algorithmdesigned to approximately find Nash equilibriums for large games. CFRtries to minimize overall counterfactual regret. It is proven that theaverage of the strategies in all iterations would converge to a Nashequilibrium. When solving a game, CFR in its original form (alsoreferred to as original CFR, standard CFR, vanilla CFR, or simply, CFR)traverses the entire game tree in each iteration. Thus, the original CFRrequires large memory for large, zero-sum extensive games such asheads-up no-limit Texas Hold'em. In some instances, the original CFR maynot handle large games with limited memory.

A Monte Carlo CFR (MCCFR) was introduced to minimize counterfactualregret. MCCFR can solve imperfect information games from sampledexperiences. Different from the original CFR, MCCFR samples a subset ofnodes in a game tree in each iteration. The MCCFR can compute anunbiased estimation of counterfactual value and avoid traversing theentire game tree. Since only subsets of all information sets are visitedin each iteration, MCCFR requires less memory than the original CFR.MCCFR can include different versions or variants, for example, dependingon different sampling polices. MCCFR typically has poor long-termperformance and high variance due to the sampling.

This specification describes example techniques to accelerate theconvergence of MCCFR. For example, the techniques include a vector-formsampling policy, a variance reduction method with a provable unbiasedestimate, an exploration technique, and a hybrid MCCFR variants withskipping mechanism and discounting updates. These one or more techniquescan be combined together and applied to MCCFR. The experiment resultsshowed that the described techniques can bring about 100×∥1000× speedupin many settings for MCCFR.

The techniques described in this specification can generate one or moretechnical advantages. In some embodiments, the described techniques canbe performed by an execution device for generating an action selectionpolicy for completing a task in an environment that includes theexecution device and one or more other devices. In some embodiments, thedescribed techniques can determine an action selection policy for asoftware-implemented application that performs actions in an environmentthat includes an execution party supported by the application and one ormore other parties. In some embodiments, the described techniques can beused in automatic control, autonomous vehicle control, robotics, or anyother application that involves action selections. For example, thedetermined action selection policy can be used to control engines,motors, actuators, valves, and any other equipment or be applied in acontrol circuit for controlling operations of one or more devices. Inone example, a control system of an autonomous vehicle can be adapted tocontrol the speed, acceleration, direction, and/or travel time of theautonomous vehicle, given prediction of movements of other vehicles inthe environment. The control system can help the autonomous vehicle toreach a desired destination with better route selection, reduced traveltime, and/or lower fuel consumption. This may facilitate, for example,traffic planning, accident avoidance, and increased operational safety.

As an example of an application in autonomous vehicles, the environmentcan include multiple autonomous vehicles for completing a task such astraffic planning or control to avoid collision and reach respectivedestinations of the multiple autonomous vehicles. Each of the multipleautonomous vehicles can be equipped with an execution device that canimplement software-implemented applications for generating an actionselection policy for completing the task in the environment. Thegenerated action selection policy includes control informationconfigured to control one or more of an engine, motor, actuator, brake,etc. of the autonomous vehicle. It can, thus, be used by each of themultiple autonomous vehicles to control one or more engine, motor,actuator, brake, etc. of the autonomous vehicle so that the autonomousvehicle can follow the generated action selection policy to achieve thetask. In some embodiments, the task can be modelled by an IIG and theaction selection policy to achieve the task can be generated by computersimulation, for example, by solving the IIG. Each of the multipleautonomous vehicles can represent a party of the IIG. The actions caninclude , for example, one or more of a specified direction, speed,distance, timing, or any other metrics of the autonomous vehicle. Theaction selection policy of the autonomous vehicle can include a strategyof selecting respective actions at different states (e.g., differentintersections in a geographic location) so that the autonomous vehiclecan navigate through the environment and reach the destination.

As another example of an application in robotics, the environment caninclude an industrial robot (e.g., a warehouse robot) that interactswith one or more other parties (e.g., other robots) in order to completea task (e.g., to move items in the warehouse or to assemble someproduct). In some embodiments, the task can be modelled by an IIG andthe action selection policy to achieve the task can be generated bycomputer simulation, for example, by solving the IIG. The action caninclude, for example, one or more of a specified direction, location,speed, or any other motions of the industrial robot. The actionselection policy of the industrial robot can include a strategy ofselecting respective actions at different states (e.g., differentlocations in a warehouse) so that the industrial robot can navigatethrough the environment and complete the task (e.g., moving the items inthe warehouse).

In some embodiments, the described techniques can help find betterstrategies of real-world scenarios such as resource allocation,product/service recommendation, cyber-attack prediction and/orprevention, traffic routing, fraud management, etc. that can be modeledor represented by strategic interaction between parties, such as, an IIGthat involves two or more parties. In some embodiments, the describedtechniques can leverage advanced sampling schemes (e.g., withconsideration of vectors of current strategies and/or with exploration),which return strategies having smaller variances, closer to globalrather than local optimal solution, or closer to Nash equilibrium.

In some embodiments, the described techniques can help find strategiesof real-world scenarios in a more efficient manner. Accordingly,solutions or strategies of real-world scenarios can be found with a lessamount of computer simulation and/or within reduced latency/responsetime. For example, compared to original CFR, the described techniquesare based on MCCFR that only sample some of all possible combinations ofactions of the players of the IIG, which significantly reducescomputational loads for traversing or exhausting all possiblecombinations of actions for simulating and solving the IIG. In someembodiments, the solutions or strategies can be found within asignificantly shorter response time, helping make possible certainreal-world scenarios that require real-time or near real-time responseor control.

In some embodiments, the described techniques can improve theconvergence speed, improve computational efficiency, and reduce thecomputational load of the MCCFR algorithm in finding Nash equilibriumfor solving a game that represents one or more real-world scenarios. Insome embodiments, the described techniques can reduce variances of theMCCFR algorithm.

In some embodiments, the described vector-form sampling policies canprovide more efficient sampling policies when the MCCFR is implementedin a vector form. The described vector-form sampling policies can takeinto account multiple different strategies at a decision point andcompute a sampling policy that pay more attention to the relativelyimportant actions, while achieving better long-term performances offinding the Nash equilibrium (including approximated Nash equilibrium),for example, by improving the convergence speed of performing MCCFR.

In some embodiments, the described variance reduction technique canreduce the variance and reduce the number of iterations of MCCFR. Insome embodiments, the described variance reduction technique can reducethe computational load and improve the computational efficiency by usinga control variate algorithm based on a counterfactual value baseline,rather than based on an expected utility value baseline.

In some embodiments, the described hybrid MCCFR algorithm with askipping mechanism and discounting updates can accelerate theconvergence and reduce variance of MCCFR compared to state-of-the-artmethods.

In some embodiments, an extensive-form IIG can be represented asfollows. There are n players (except for chance) in the IIG. N={1, . . ., n} is a finite set of the players and each member refers to a player.In a two-player game, N={1,2}. These two players are denoted by p1 andp2. The hidden information (variable) of player i is unobserved by theopponents, which is denoted by h^(v) _(i). Each member h ∈ H refers to apossible history (or state). The history (or state) can include asequence of actions (including actions of the chance) that lead to thestate.

For player i, h_(−i) ^(v) refers to all the players' hidden informationexcept for the player i. The empty sequence ∅ is a member of H. h_(j)

h denotes h_(j) is a prefix of h. Z denotes the set of terminalhistories and any member z ∈ Z is not a prefix of any other sequences. Aterminal history can also be referred to as a terminal state, which canbe an end state of the IIG. No further actions needs to be taken by anyplayer in a terminal history. Each terminal history z ∈ Z has anassociated utility or payoff for each player i.

A player function P assigns a member of N ∪{c} to each non-terminalhistory, where c refers to the chance player. P(h) is the player whotakes actions at h. A(h)={a:ha ∈ H} is the set of available actionsafter h ∈ H\Z. A non-terminal history can also be referred to as anon-terminal state, which can be intermediate state of the IIG. One ormore players can have possible actions at a non-terminal state thatleads to another state.

I_(i) of a history {h ∈ H:P(h)=i} is an information partition of playeri. A set I_(i) ∈ I_(i) is an information set (infoset) of player i andI_(i)(h) refers to infoset I_(i) at state h. For I_(i) ∈ I_(i), we haveA(I_(i))=A(h) and P(I_(i))=P(h). If all players in one game can recalltheir previous actions and infosets, it is referred to as aperfect-recall game.

Given all players' histories, a prefix tree (trie) can be builtrecursively. Such a prefix tree is called a game tree in game theory.Each node in the game tree refers to a history h. The infoset tree foreach player is built on infosets rather than histories. A public tree isa prefix tree built on public sequences. Each of the public sequencescan include actions that are publically known or observable by allplayers or even by a third-party observer. In some embodiments, aterminal history or a terminal state can be represented by a terminalnode or a leaf node of the game tree or public tree. A non-terminalhistory or a non-terminal state can be represented by a non-terminalnode of the game tree or public tree. A terminal history z correspondsto a sequence of actions (also referred to as a terminal sequence ofactions) that include actions taken by all players that results in theterminal history z. For example, a terminal history z corresponds to asequence of actions along a trajectory or path from the root node to theterminal node z of the game tree or public tree that includes actionstaken by all players that results in the terminal history z.

FIG. 1A and 2B are diagrams illustrating examples of a game tree 100 anda public tree 150 of Kuhn Poker in accordance with embodiments of thisspecification. Kuhn Poker is an example of a zero-sum two-player IIG ofpoker. Kuhn Poker is an example of an extensive-form game. The gamerules are defined as follows. The deck includes only three playingcards, for example, a King (K), Queen (Q), and Jack (J). One card isdealt to each player, which may place bets similarly to a standard pokergame. If both players bet or both players pass, the player with thehigher card wins, otherwise, the betting player wins.

A game tree is a directed graph. The nodes of the game tree representpositions (or states of a player) in a game. As shown in FIG. 1A, theroot node 110 of the game tree 100 is represented by a circle, which isa chance node (also referred to as player 0). Each terminal node or leafnode (e.g., a terminal node 143 a, 153 a, 153 b, 143 c, 143 d, 147 a,157 a, 157 b, 147 c, or 147 d) of the game tree 100 is represented by adiamond, indicating a terminal state which shows a payoff of the one ortwo players in the game. Each square (e.g., a non-terminal node 123,127, 143 b, or 147 b) represents a state of player 1. Each triangle(e.g., a non-terminal node 133 a, 133 b, 137 a, or 137 b) represents astate of player 2. In some embodiments, h_(i) represents a non-terminalnode and z_(i) represents a terminal node.

After each player is dealt with a card, there are six different possiblestates. As shown by six arrows out of the root node 110, six differentpossible states are [J, Q], [J, K], [Q, J], [Q, K], [K, J], [K, Q],indicating the card received by player 1 and player 2, respectively. Thegame tree 100 shows subtrees 103 and 107 of two of the six possiblestates. The left subtree 103 corresponding to a state [J, Q] indicatesthat the two players (player 1 and player 2) are dealt with J and Q,respectively. The right subtree 107 corresponding to a state [J, K]indicates that the two players (player 1 and player 2) are dealt with Jand K, respectively.

Arrows out of the node (or edges) of the game tree can representpossible actions of a player at the state of the game. As shown in FIG.1A, the arrows out of the node 123 represent possible actions A_(1a) andA_(1b) of the player 1 at the state of the node 123 corresponding to thestate [J, Q]. Similarly, arrows out of the node 133 a represent possibleactions A_(2a) and A_(2b) of the player 2 at the state of the node 133 acorresponding to a state of [J, Q, A_(1a)], where the player 1 chooses.Arrows out of the node 133 b represent possible actions A_(2c) andA_(2c) of the player 2 at the state of the node 133 b corresponding to astate of [J, Q, A_(1b)].

The trajectory from the root node 110 to each node is a sequence orhistory of actions. For example, as illustrated in the subtree 103, thenon-terminal node 143 b corresponds to a sequence or history of actions(can be denoted as h_(143b)) including actions [J, Q, A_(1a), A_(2b)].The terminal node 153 b corresponds to a sequence or history of actions(can be denoted as h_(153b)) including actions [J, Q, A_(1a), A_(2b),A_(3b)]. Since the node 153 b is a terminal node, the sequence ofactions [J, Q, A_(1a), A_(2b), A_(3b)] can be referred to as a terminalsequence of action (that leads to or results in the terminal state 153b. In the subtree 103, the node 143 b is a prefix of the terminal node153 b. Similarly, the terminal node 143 c corresponds to a sequence orhistory of actions (can be denoted as h_(143c)) including actions [J, Q,A_(1b), A_(2c)].

In the IIG, the private card of player 1 is invisible to player 2.Similarly, the private card of player 2 is invisible to player 1.Therefore, the information available to player 1 at node 123corresponding to the state [J, Q] and the node 127 corresponding to thestate [J, K] are actually the same because player 1 only knows hisprivate card J and does not know whether the opponent's, player 2's,private card is Q or K. An information set L can be used to denote theset of these undistinguished states. Let h₁₂₃ denote the state of node123 and h(h₁₂₃) denote the information set at the state of node 123, andh₁₂₇ denote the state of node 127 and h(h₁₂₇) denote the information setat the state of node 127. In this example, I₁(h₁₂₃)=I₁(h₁₂₇). Typically,any I_(i) ∈ I includes the information observed by player i includingplayer is hidden variables (e.g., private cards) and public actions. Inthis example, I₁(h₁₂₃)=I₁(h₁₂₇)=J, which can be denoted as I₁₁.

Similarly, the information available to player 1 at node correspondingto states [Q, J] and [Q, K] are the same, which can be represented bythe same information set I₁₂ that includes player 1's private card Q.The information available to player 1 at node corresponding to states[K, J] and [K, Q] are the same, which can be represented by the sameinformation set 113 that includes player 1's private card K.

FIG. 1B shows the public tree 150 corresponding to the game tree 100.Each node 125, 135 a, 135 b, 145 a, 145 b, 145 c, 145 d, 155 a, or 155 bin public tree 150 can represent a public state that includes a sequenceor history of public actions (also referred to as a public sequence).Each node corresponds to a vector of infosets {right arrow over(I)}_(l)=[I_(i1), I_(i2), I_(i3), . . . ]. For ∀I_(ij), I_(ij) ∈ {rightarrow over (I)}_(l), they can indicate the same public sequence. |{rightarrow over (I)}_(l)| refers to the length of the vector. For example, asshown in FIG. 1B, the node 125 corresponds to an initial publicsequence, which is empty Ø in this example. The node 125 is associatedwith a vector of infosets of player 1, {right arrow over (I)}₁=[I₁₁,I₁₂, I₁₃], corresponding to player 1's private card of J, Q, K,respectively.

As another example, the node 135 a can represent a public sequence thatincludes player 1's action [A_(1a)] and corresponds to a vector ofinfosets of player 2. Similarly, node 135 b can represent a publicsequence that includes player 1's action [A_(1b)], and corresponds toanother vector of infosets of player 2. The non-terminal node 145 bcorresponds to a public sequence that includes public actions [A_(1a),A_(2b)]. The terminal node 155 b corresponds to a public sequence thatincludes public actions [A_(1a), A_(1b), A_(3b)].

In some embodiments, the non-terminal node 145 b in the public tree 150can represent the common public state among six different possibleinitial states of [J, Q], [J, K], [Q, J], [Q, K], [K, J], and [K, Q].The common public state of the non-terminal node 145 b includes a publicsequence that includes public actions [A_(1a), A_(2b)] corresponding toa vector of infosets of player 1 at the non-terminal node 145 b, {rightarrow over (I₁)}(node 145 b)=[I₁₁(node 145 b), I₁₂(node 145 b), I₁₃(node145 b)]. I₁₁(node 145 b) can represent the information set of player 1at the non-terminal node 145 b that includes player 1's private actionand the common sequence that lead to the non-terminal node 145 b. Thatis, I₁₁ (node 145 b)=[J, A_(1a), A_(2b)]. Similarly, I₁₂ (node 145b)=[Q, A_(1a), A_(2b)]; and I₁₃ (node 145 b)=[K, A_(1a), A_(2b)]. Theinformation set I₁₂(node 145 b) can be shared by the two nodes 143 b and147 b in the game three 100. The node 143 b corresponds to a sequence ofboth private and public actions of all players in the game that leads tothe node 143 b. That is, h_(143b)=[J, Q, A_(1a), A_(2b)]. Similarly, thenode 147 b corresponds to a sequence of both private and public actionsof all players in the game that leads to the node 147 b. That is,h_(1473b)=[J, K, A_(1a), A_(2b)]. As can be seen, h_(143b) and h_(1473b)share the same information set I₁₁(node 145 b)=[J, A_(1a), A_(2b)].

In some embodiments, the strategy and Nash Equilibrium of an IIG can berepresented as follows. For a player i ∈ N, the strategy σ_(i)(I_(i)) inan extensive-form game assigns an action distribution over A(I_(i)) toinfoset I_(i). A strategy profile can be denoted as σ={σ_(i)|σ₁∈ Σ_(i),i ∈ N}, where Σ_(i) is the set of all possible strategies for player i.σ_(−i) refers to all strategies in σ except for σ_(i). σ_(i)(I_(i)) isthe strategy of infoset I_(i). σ_(i)(a|h) refers to the probability ofaction a taken by player i at state h. ∀h₁,h₂ ∈I_(i),I_(i)(h₁)=I_(i)(h₂), σ_(i)(I_(i))=σ_(i)(h₂),σ_(i)(a|I_(i))=σ_(i)(a|h₁)=σ_(i)(a|h₂). In some embodiments, thestrategy σ^(i)(I_(i)) specifies and comprises a respective probabilityσ_(i)(a|h) of selecting an action a among the plurality of possibleactions in the state h under the strategy σ_(i)(I_(i)). For example, forplayer 1 at the node 123 of the game tree 100 in FIG. 1A, the strategyσ_(i)(I_(i)) can include a probability σ₁(A_(1a)|node 123) of selectingthe action A_(1a) among the two possible actions A_(1a) and A_(1b) inthe state of the node 123, and a probability σ₁(A_(1b)|node 123) ofselecting the action A_(1b) among the two possible actions A_(1a) andA_(1b) in the state of the node 123. If the strategy σ₁(I_(i)) isuniform (e.g., an initial strategy), the probability σ₁(A_(1a)|node123)=0.5, and the probability σ₁(A_(1b)|node 123)=0.5. In someembodiments, the strategy σ₁(I_(i)) can be updated in each iteration ofthe CFR so that, when the CFR converges, a player can approach the NashEquilibrium (or approximate Nash Equilibrium) if the player selects theactions at state h or given the information set I followingprobabilities specified in the strategy σ₁(I_(i)). For example, if thestrategy σ₁(I_(i)) output by the CFR specifies the probabilityσ₁(A_(1a)|node 123)=0.2, and the probability σ₁(A_(1b)|node 123)=0.8.the player can select the action A_(1b) with a probability of 0.8 atstate h or given the information set I to approach the Nash Equilibrium(or approximate Nash Equilibrium).

For iterative learning methods such as CFR, σ^(t) refers to the strategyprofile at t-th iteration. π^(σ)(h) refers to the state reachprobability (also called a state range),which is the product ofstrategies of all players (including chance, such as the root node 110in game tree 100) along the history h. For an empty sequence,π^(σ)(Ø)=1.

In some embodiments, the reach probability can be decomposed into

π^(σ)(h)=Π_(i∈N∪{c})π_(i) ^(σ)(h)=π_(i) ^(σ)(h)π_(−i) ^(σ)(h),   (1)

where π_(i) ^(σ)(h) is the product of player i's strategy σ_(i) andπ_(−i) ^(σ)(h) is the product of strategies of all players' except I,denoted as σ_(−i). ∀h∈ I_(i), π_(i) ^(σ)(h)=π_(i) ^(σ)(I_(i)).

For two histories h₁ and h₂, h1

h2, π^(σ)(h₁, h₂) refers to the product of all players' strategies fromhistory h1 to h2. π_(i) ^(σ)(h₁, h₂) and π^(σ) _(−i)(h₁, h₂) can bedefined in a similar way. The infoset reach probability (infoset range)of I_(i) can be defined by π_(i) ^(σ)(I_(i))=Σ_(h∈I) _(i) π_(i) ^(σ)(h).Similarly, π^(σ) _(−i)(I_(i))=Σ_(h∈I) _(i) πσ_(−i)(h).

For player i, the expected game utility can be computed by u_(i)^(σ)=Σ_(z∈Z)π^(σ)(z)u_(i)(z). Given a fixed strategy profile σ_(−i), abest response is denoted as

br(σ_(−i))=argmax_(σ′) _(i) _(∈Σ) _(i) u _(i) ^((σ′) ^(−i) ^(,σ) ^(−i) ⁾  (2).

An ε-Nash equilibrium is an approximated Nash equilibrium, whosestrategy profile σ*=(br94 _(−i)),br(σ_(i))) satisfies:

∀i ∈ N, u _(i) ^((br(σ) ^(−i) ^(),σ) ⁻¹⁾ +ε≥max_(σ′) _(i) _(∈Σ) _(i) u_(i) ^((σ) ^(i) ^(,σ) ^(−i) ⁾   (3).

Exploitability of a strategy a, can be defined as ε_(i)(σ_(i))=u_(i)^(σ+)−u_(i) ^((σ) ^(i) ^(,br(σ) ^(i) ⁾⁾. A strategy is unexploitable ifε_(i)(σ_(i))=0. In large two player zero-sum games such as poker, u_(i)^(σ′) can be intractable to compute. However, if the players alternatetheir positions, the value of a pair of games is zero, i.e., u₁ ^(σ*)+u₂^(σ*)=0. The exploitability of strategy profile σ can be defined asε(σ)=(u₂ ^((σ) ¹ ^(,br(σ) ¹ ⁾⁾+u₁ ^((br((σ) ² ^(),σ) ² ⁾)/2.

CFR is an iterative method for finding a Nash equilibrium on zero-sumperfect-recall imperfect information games. A counterfactual value v_(i)^(σ) ^(t) (I_(i)) can be computed to represent an expected utility forplayer i at the information set I_(i) under the current strategy profileσ^(t), assuming that player i plays to reach I_(i). In some embodiments,given σ^(t), the counterfactual value v_(i) ^(σ) ^(t) (I_(i)) can becomputed by

$\begin{matrix}{{v_{i}^{\sigma^{t}}\left( I_{i} \right)} = {{\Sigma_{h \in I_{i}}{\pi_{- i}^{\sigma^{t}}(h)}\Sigma_{{h \sqsubseteq z},{z \in Z}}{\pi^{\sigma^{t}}\left( {h,z} \right)}{u_{i}(z)}} = {{\Pi_{- i}^{\sigma^{t}}\left( I_{i} \right)}{{U_{i}^{\sigma^{t}}\left\lbrack I_{i} \right\rbrack}.}}}} & (4)\end{matrix}$

where Π_(−i) ^(σ) ^(t) (I_(i)) ∈

^(1×d) is the opponent's range matrix (i.e., the reach probability ofthe opponent), U_(i) ^(σ) ^(t) [I_(i)] ∈

^(d×1) is the expected utility value matrix of player i given theinformation set I_(i), and d refers to the dimension.

In some embodiments, Π_(−i) ^(σ) ^(t) (I_(i)) can be computed as theproduct of strategies of all players except player i along the history h∈ I_(i), representing a posterior probability of the opponent's actionsgiven that player i reaches the current information set I_(i) under thecurrent strategy profile σ^(t). U_(i) ^(σ) ^(t) [I_(i)] can representthe expected utility value matrix given that player i reaches thecurrent information set I_(i) under the current strategy profile σ^(t)and assuming a uniform distribution of opponent's private actions.

For example, with respect to Kuhn Poker in FIGS. 1A and 1B, U_(i) ^(σ)^(t) [I_(i)] can represent the expected utility value of player 1 whenplayer 2 is dealt with a private card of J, Q, or K with a uniformdistribution, respectively, while Π_(−i) ^(σ) ^(t) (I_(i)) can be avector of a probability of player 2 is dealt with a private card of J,Q, or K, respectively, given that player 1 reaches the currentinformation set I_(i) under the current strategy profile σ^(t).

As another example, in heads-up no-limit Texas hold'em poker (HUNL),each entry in Π_(−i) ^(σ) ^(t) (I_(i)) refers to an opponent's rangewhen dealt with a particular pair of private cards. Each entry in U_(i)^(σ) ^(t) [I_(i)] refers to the expected utility value given twoplayers' private cards and current strategies.

v_(i) ^(σ) ^(t) (a|I_(i)) refers to the counterfactual value of action aand its regret can be computed by:

r _(i) ^(σ) ^(t) (a|I _(i))=v _(i) ^(σ) ^(t) (a|I _(i))−v _(i) ^(σ) ^(t)(I _(i)),   (5).

The cumulative regret of action a after t iterations is

R _(i) ^(t)(a|I _(i))=R_(i) ^(t−1)(a|I _(i))+r _(i) ^(σ) ^(t) (a|I_(i)),   (6)

where R_(i) ⁰(a|I_(i))=0.

Define R_(i) ^(t,+)(a|I_(i))=max(R_(i) ^(t)(a|I_(i)), 0), the currentstrategy at t+1 iteration can be computed based on regret matchingaccording to:

$\begin{matrix}{{\sigma_{i}^{t + 1}\left( a \middle| I_{i} \right)} = \left\{ {\begin{matrix}{\frac{1}{{A\left( I_{i} \right)}},{{{if}\mspace{14mu} {\sum_{a \in {A{(I_{i})}}}{R_{i}^{t, +}\left( a \middle| I_{i} \right)}}} = 0}} \\{\frac{R_{i}^{t, +}\left( a \middle| I_{i} \right)}{\sum_{a \in {A{(I_{i})}}}{R_{i}^{t, +}\left( a \middle| I_{i} \right)}},{otherwise}}\end{matrix}.} \right.} & (7)\end{matrix}$

The average strategy σ _(i) ^(T) after T iterations can be computed by

$\begin{matrix}{{\sigma_{i}^{T}\left( a \middle| I_{i} \right)} = {\frac{\sum_{t = 1}^{T}{{\pi_{i}^{\sigma^{t}}\left( I_{i} \right)}{\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}}{\sum_{t = 1}^{T}{\sum_{a \in {A{(I_{i})}}}{{\pi_{i}^{\sigma^{t}}\left( I_{i} \right)}{\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}}}.}} & (8)\end{matrix}$

CFR+ is similar to CFR, except that the CFR+replaces regret matching byregret matching+ and uses a weighted average strategy. CFR and CFR+ areproven to approach Nash equilibria after enough iterations. The bestknown theoretical bound for CFR and CFR+ to converge to equilibrium is

${\left( \frac{1}{\epsilon^{2}} \right)}.$

This bound is slower than first-order methods that converge at rate

${\left( \frac{1}{\epsilon} \right)}.$

However, CFR+ empirically converges much faster than

$\left( \frac{1}{\epsilon} \right)$

in many games.

MCCFR computes the unbiased estimate of counterfactual value by samplingsubsets of infosets in each iteration. Define Q={Q₁, Q₂, . . . , Q_(m)},where Q_(j) ∈ Z is a set (block) of sampled terminal histories generatedby MCCFR, such that Q_(j) spans the set Z. Define q_(Q) _(j) as theprobability of considering block Q_(j), where Σ_(j=1) ^(m)q_(Q) _(j) =1.Define q(z)=Σ_(j:z∈Q) _(j) q_(Q) _(j) as the probability of consideringa particular terminal history z. The particular terminal history zcorresponds to a sequence of actions (also referred to as a terminalsequence of actions) that include actions taken by all players thatresults in the terminal history z. In some embodiments, the probabilityof considering a particular terminal history z is a probability that theparticular terminal history z is sampled (also referred to a probabilityof a sampled terminal sequence of actions). In some embodiments, theprobability of a sampled terminal history z or the probability of asampled terminal sequence of actions can be computed based on samplingprobabilities of all actions included in the sampled terminal sequenceof actions that leads to the sampled terminal history z. For example, ifthe sampled terminal sequence of actions that leads to the sampledterminal history z includes a sequence of actions [A₁, A₂, . . . ,A_(m)], q(z) can be computed as a product of respective samplingprobabilities of all the actions in the sampled terminal sequence ofactions [A₁, A₂, . . ., A_(m)].

The estimate of sampled counterfactual value (also referred to asestimate counterfactual value) of I_(i) can be computed by:

$\begin{matrix}{{{v_{i}^{\sigma}\left( I_{i} \middle| Q_{i} \right)} = {\sum_{{h \in I_{i}},{z \in Q_{j}},{h \sqsubseteq z}}{\frac{1}{q(z)}{\pi_{- i}^{\sigma}(z)}{\pi_{i}^{\sigma}\left( {h,z} \right)}{u_{i}(z)}}}},.} & (9)\end{matrix}$

where 1/q(z) can represent the importance of the particular sampledterminal history z in calculating the sampled counterfactual value{tilde over (v)}_(i) ^(σ)(I_(i)|Q_(i)).

Define σ^(s) as sampled strategy profile, where σ_(i) ^(s) is thesampled strategy of player i and σ_(−i) ^(s) are those for other playersexcept for player i. The regret of the sampled action a ∈ A(I_(i)) canbe computed by:

{tilde over (r)} _(i) ^(σ)(I _(i) , a|Q _(i))={tilde over (v)} _(i)^(σ)(I _(i) , a|Q _(i))−{tilde over (v)} _(i) ^(σ)(I _(i) |Q _(i)),  (10)

where

{tilde over (v)} _(i) ^(σ)(I _(i) , a|Q _(j))=Σ_(zεQ) _(j,ha⊆z,hεi)π_(i) ^(σ() ha, z)u _(i) ^(s)(z),   (11)

where

${u_{i}^{s}(z)} = \frac{u_{i}(z)}{\pi_{i}^{\sigma^{S}}(z)}$

is the utility weighted by

$\frac{1}{\pi_{i}^{\sigma^{S}}(z)}.$

The estimate cumulative regret of action a after t iterations is

$\begin{matrix}{{{{\overset{\sim}{R}}_{i}^{t}\left( {I_{i},{aQ_{j}}} \right)} = {{{\overset{\sim}{R}}_{i}^{t - 1}\left( {I_{i},{aQ_{j}}} \right)} + {{\overset{\sim}{r}}_{i}^{\sigma^{t}}\left( {I_{i},{aQ_{j}}} \right)}}},} & (12)\end{matrix}$

where {tilde over (R)}_(i) ⁰(I_(i), a|Q_(j))=0.

The current strategy at t+1 iteration can be computed based on regretmatching according to Eq. (7) or regret matching+ similar to theoriginal CFR. Similarly, the average strategy σ _(i) ^(T) after Titerations can be computed according to Eq. (8).

MCCFR provably maintains an unbiased estimate of counterfactual valueand converge to Nash equilibrium. Outcome sampling and external samplingare two popular sampling methods. The original outcome sampling choosesone history according to two players' current strategy policy (orε-greedy). The external sampling is very similar to outcome samplingexcept for one player taking all actions at her decision node. In eachiteration, the classical MCCFR designates one player as the traverser,whose cumulative regret and strategy will be updated on this iteration.After that, another player will be designated as the traverser. Anothersampling method, robust sampling, has been proposed, in which thetraverser samples k actions and the opponent samples one action. In therobust sampling scheme, each player uses a uniform sampling method tosample at a current decision point, and the other party samplesaccording to a corresponding strategy. The reach probabilitycorresponding to different iterations can be fixed. It can be provedthat the robust sampling scheme has a smaller variance than the outcomesampling scheme in MCCFR, while being more memory efficient than theexternal sampling. In some embodiments, the robust sampling scheme canmake the MCCFR solve Nash equilibrium (including approximated Nashequilibrium) with faster convergence.

MCCFR and its variants can be classified into three types: value-form,semi-vector-form, and vector-form MCCFR. To make a clear explanation,these three types of MCCFR forms are explained as being applied in KuhnPoker as shown in FIGS. 1A-B. Here, the robust sampling is used as thedefault sampling method and player p1 is the traverser. At each decisionnode, p1 samples one action according to a uniform random policy and p2samples one action according to p2′scurrent strategy.

Value-form MCCFR: At the start of each iteration, p1 and p2 are dealtone private card respectively, such as J for p1 and Q for p2, as shownin the left subtree 103. Then they play against each other until theend. In a perfect-recall two-player imperfect information game, givenpublic sequence and p2's private card, a particular infoset I₂ ∈ I₂ canbe determined. p2 samples one action according to σ₂(I₂). In thisscenario, the value-form MCCFR generates one history h on eachiteration. The value of the terminal node is the game payoff.

Semi-vector-form MCCFR: Suppose p2 is dealt with private card Q and p1is dealt with a vector of private cards [J, K]. Similar to thevalue-form MCCFR, these two players play against each other until theend. p1's decision node maintains a vector of infosets {right arrow over(I₁)}=[I₁₁,I₁₂] and p2's node maintains one infoset I₂. Also, I₁indicates a vector of policies {right arrow over (σ₁)}=[σ₁₁,σ₁₂]. Inthis scenario, p2 samples one action according to σ₂(I₂). When usingrobust sampling, p1 samples her actions according to uniform randompolicy rather than the vector of policies {right arrow over (σ₁)}, sothat it is unnecessary to specify a particular current strategy as thesampling policy. Semi-vector-form MCCFR updates a vector of thetraverser's regrets and strategies on each iteration. It's expected thatsemi-vector-form MCCFR can benefit from efficient matrix manipulationand empirically converge faster than value-form MCCFR.

Vector-form MCCFR: This method does not need to specify private cardsfor p1 and p2. As shown in FIG. 1B, the decision node of player i ∈[1,2] (e.g., non-terminal nodes 125, 135 a, or 135 b) maintains a vectorof infosets {right arrow over (I)}_(i)=[I_(i1), I_(i2), I_(i3)]. In eachiteration, the vector-form MCCFR generates a vector of sequences alongthe public tree 150 (e.g., from the node 125 to a terminal node such asthe node 155 b following the public sequences [A_(1a), A_(2b),A_(3b),]).

Because each decision node {right arrow over (I)}_(i) indicates a vectorof current strategies {right arrow over (σ_(l))}=[σ_(i1),σ_(i2),σ_(i3)].A sampling policy needs to be determined given the multiple currentstrategies in the vector of current strategies {right arrow over(σ)}_(i) to sample an action out of possible actions of the player i atthe decision node {right arrow over (I)}_(i). Rather than using auniform sampling policy so that each infoset in {right arrow over(I)}_(i) shares the same uniform policy, several non-uniform samplingpolicies are described. In some embodiments, these non-uniform samplingpolicies can pay more attention to the relatively important action andalso achieve better long-term performance.

Random Current Strategy (RCS): When using RCS, player i randomly selectsone infoset I_(i) from {right arrow over (I)}_(i) and samples one actionaccording to σ_(i) ^(t)(I_(i)).

Mean Current Strategy (MCS): This sampling policy is the mean of thecurrent strategy over all the infosets in {right arrow over (I)}_(i),which can be computed by

$\begin{matrix}{{\sigma_{i}^{mcs}\left( a \middle| {\overset{\rightarrow}{I}}_{i} \right)} = {\frac{\sum_{I \in \overset{\rightarrow}{I}}\mspace{14mu} {\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}{\sum_{I \in {\overset{\rightarrow}{I}}_{i}}{\sum_{a \in {A{(I_{i})}}}{\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}} = \frac{\sum_{I \in \overset{\rightarrow}{I}}\mspace{14mu} {\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}{{\overset{\rightarrow}{I}}_{i}}}} & (13)\end{matrix}$

The MCS gives different infosets {I_(i)} in {right arrow over (I)}_(i)the same weight.

Weighted Current Strategy (WCS): In the field of game theory, a playertypically has a very low probability of taking disadvantageous action.Typically, the players make different decisions under differentsituations. For example, the player may need to take a more aggressivestrategy after beneficial public cards are revealed in a poker game.Accordingly, in WCS, on top of the average strategy in Eq. (8),different infosets {I_(i)} in {right arrow over (I)}_(i) can be weighteddifferently. For example, the infoset I_(i) can be weighted by playeri's range. In this case, the WCS sampling policy can be defined by

$\begin{matrix}{{\sigma_{i}^{wcs}\left( a \middle| {\overset{\rightarrow}{I}}_{i} \right)} = {\frac{\sum_{I \in \overset{\rightarrow}{I}}\mspace{14mu} {{\pi_{i}^{\sigma^{t}}(I)}{\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}}{\sum_{I \in {\overset{\rightarrow}{I}}_{i}}{\sum_{a \in {A{(I_{i})}}}{{\pi_{i}^{\sigma^{t}}(I)}{\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}}}.}} & (14)\end{matrix}$

In some embodiments, the WCS sampling strategies can include otherversions, for example, by applying different or additional weights. Forexample, the player i's own range π_(i) ^(σ) ^(t) (I_(i)) in Eq. (14)can be replaced by the opponent's range π_(−i) ^(σ) ^(t) (I_(i)) or bothplayers' range π^(σ) ^(t) (I_(i)). In many settings, the above-mentionedWCS sampling strategies can approach Nash equilibrium efficiently.

Weighted Average Strategy (WAS): In WAS, the current strategy in Eq.(13) and Eq. (14) can be replaced by the average strategy within titerations as an approximation of Nash equilibrium. For example, byreplacing the current strategy σ^(t) in Eq. (13) by the average strategyσ ^(t), the weighted average strategy can be defined by

$\begin{matrix}{{\sigma_{i}^{was}\left( a \middle| {\overset{\rightarrow}{I}}_{i} \right)} = {\frac{\sum_{I \in {\overset{\rightarrow}{I}}_{i}}\mspace{14mu} {{\pi_{i}^{\sigma^{t}}(I)}{{\overset{\_}{\sigma}}_{i}^{t}\left( a \middle| I_{i} \right)}}}{\sum_{I \in {\overset{\rightarrow}{I}}_{i}}{\sum_{a \in {A{(I_{i})}}}{{\pi_{i}^{\sigma^{t}}(I)}{{\overset{\_}{\sigma}}_{i}^{t}\left( a \middle| I \right)}}}}.}} & (15)\end{matrix}$

In some embodiments, π_(i) ^(σ) ^(t) (I_(i)) rather than π_(i) ^(σ)^(t−1) (I) can be used as the weight of each infoset in Eq. (15),because Eq. (8) and Eq. (15) share the same weight.

MCCFR learns state-action policy from the sampling experience. Variancereduction techniques used in Monte Carlo methods can be applied toMCCFR. For example, control variate is a variance reduction techniquewhere one can lower the variance of a random variable by subtractinganother random variable and adding its expectation. A baseline can beused in variance reduction techniques. A baseline allows increasing ordecreasing the log probability of actions based on whether they performbetter or worse than the average performance when starting from the samestate. In some embodiments, to reduce the variance, a particularbaseline can be specified for each counterfactual value. In someembodiments, the baseline can be a scalar. In some embodiments, thebaseline-corrected CFV can be an original CFV minus the specifiedbaseline.

In some embodiments, rather than using an expected utility value, acounterfactual value can be used as the baseline (referred to as acounterfactual value baseline) in variance reduction techniques appliedto MCCFR. The variance reduction based on the counterfactual valuebaseline is proved to be unbiased and can be more computationallyefficient than the ones based on the expected utility value baseline.

In the variance reduction with a counterfactual value baseline, estimatecounterfactual value can be defined recursively. Q_(j) refers to thesampled block, I_(i) refers to the sampled infoset that holds h ∈ I_(i),h

z, z εQ_(j).

Define b_(i) ^(t−1)(a|I_(i)) as the state-action baseline on iterationt−1, σ_(i) ^(se,t) as player i's sampling policy, and q(I_(i)) as theprobability of sampling I_(i).

In a vector-form MCCFR, Vh ∈ I_(i),q(h)=q(I_(i)). The estimatestate-action baseline on iteration t−1 can be computed as:

{tilde over (b)} _(i) ^(t−1)(a|I _(i))=b _(i) ^(t−1)(a|I _(i))σ_(i)^(se,t)(a|I _(i))/q(I _(i)),   (16).

Given the estimated counterfactual value {tilde over (v)}_(i) ^(σ) ^(t)(I_(i), a|Q_(j)) for action a at infoset I_(i) the baseline-correct orbaseline-enhanced value {circumflex over (v)}_(i) ^(σ) ^(t) (I_(i),a|Q_(j)) for action a can be computed by:

$\begin{matrix}{{{\hat{v}}_{i}^{\sigma^{t}}\left( {I_{i},\left. a \middle| Q_{j} \right.} \right)} = \left\{ \begin{matrix}{{{\overset{\sim}{b}}_{i}^{t - 1}\left( a \middle| I_{i} \right)},{{if}\mspace{14mu} a\mspace{14mu} {is}\mspace{14mu} {not}\mspace{14mu} {{sampled}.}}} \\{{{{\overset{\sim}{v}}_{i}^{\sigma^{t}}\left( {I_{i},\left. a \middle| Q_{j} \right.} \right)} + \frac{\left( {{\sigma_{i}^{{se},t}\left( a \middle| I_{i} \right)} - 1} \right){{\overset{\sim}{b}}_{i}^{t - 1}\left( a \middle| I_{i} \right)}}{\sigma_{i}^{{se},t}\left( a \middle| I_{i} \right)}},{otherwise}}\end{matrix} \right.} & (17)\end{matrix}$

The estimate counterfactual value {tilde over (v)}_(i) ^(σ) ^(t) (I_(i),a|Q_(j)) for infoset I_(i) can be computed by

$\begin{matrix}{{{\overset{\sim}{v}}_{i}^{\sigma^{t}}\left( I_{i} \middle| Q_{j} \right)} = \left\{ \begin{matrix}{{\sum_{{z \in I_{i}},{z \in Q_{i}}}{\frac{1}{q(z)}{\pi_{- i}^{\sigma^{t}}(z)}{u_{i}(z)}}},{{if}\mspace{14mu} I_{i}\mspace{14mu} {is}\mspace{14mu} {{terminal}.}}} \\{{\sum_{a \in {A{(I_{i})}}}{{{\overset{\sim}{v}}_{i}^{\sigma^{t}}\left( {I_{i},\left. a \middle| Q_{j} \right.} \right)}{\sigma_{i}^{t}\left( a \middle| I_{i} \right)}}},{{otherwise}.}}\end{matrix} \right.} & (18)\end{matrix}$

Define b⁰ _(i)(a|I_(i))=0. Two example methods can be used to update thebaseline. In the first method, the baseline can be updated based on theestimate counterfactual value as formulated by

$\begin{matrix}{{b_{i}^{t}\left( a \middle| I_{i} \right)} = \left\{ \begin{matrix}{{b_{i}^{t - 1}\left( a \middle| I_{i} \right)},{{if}\mspace{14mu} I_{i}\mspace{14mu} {is}\mspace{14mu} {not}\mspace{14mu} {{sampled}.}}} \\{{{\left( {1 - \gamma} \right){b_{i}^{t - 1}\left( a \middle| I_{i} \right)}} + {\gamma {{\overset{\sim}{v}}_{i}^{\sigma^{t}}\left( {I_{i},\left. a \middle| Q_{j} \right.} \right)}{q\left( I_{i} \right)}}},{{otherwise}.}}\end{matrix} \right.} & (19)\end{matrix}$

In the second method, the baseline is updated based on thebaseline-correct value {circumflex over (v)}_(i) ^(σ) ^(t) (I_(i),a|Q_(j)) as shown in Eq. (17) rather than the estimate counterfactualvalue {tilde over (v)}_(i) ^(σ) ^(t) (I_(i), a|Q_(j)) other words, the{tilde over (v)}_(i) ^(σ) ^(t) (I_(i), a|Q_(j)) in Eq. (19) is replacedwith the {circumflex over (v)}_(i) ^(σ) ^(t) (I_(i), a|Q_(j)) computedbased on Eq. (17). The second baseline is also referred to asbootstrapping baseline.

The cumulative regret and average strategy can be computed following thesimilar formulation of the original MCCFR, for example, according toEqs. (12) and (8), respectively. The estimate counterfactual value andbaseline can be updated for all the infosets along the sampling sequencerecursively.

As an example of implementing the variance reduction with acounterfactual value baseline method, for each iteration of the MCCFR,the following steps can be performed.

(a) Compute a CFV for a terminal node of a game tree or a public treeaccording to the upper (or first) equation of Eq. (18). In someembodiments, for a value-form implementation, computation of the CFV fora terminal node of a public tree according to Eq. (18) can beimplemented as a matrix (or vector) product of a (1×d matrix) and (d×1matrix), similar to Eq. (4). In some embodiments, for a vector-formimplementation, computation of the CFV for a terminal node of a publictree according to Eq. (18) can be implemented as a matrix product of a(d×d matrix) and (d×d matrix). In some embodiments, the computation ofthe CFV based on the opponent's range matrix and the expected utilityvalue matrix only need to be computed once for each public sequence forthe terminal nodes of the public sequence. The CFV of non-terminal nodecan be based on summation of weighted CFVs of child nodes of thenon-terminal node, for example, according to the lower (or second)equation of Eq. (18).

(b) Compute a baseline-corrected CFV according to Eq. (17). In avector-form implementation, since the baseline is CFV baseline, thisstep may only need two d×1 matrix additions as shown in the lowerequation in Eq. (17), rather than further operations based on theexpected utility baseline.

(c) Compute a CFV for each nonterminal node according to the lower (orsecond) equation of Eq. (18). This step includes a summation of weightedchild nodes' CFV. In a vector-form implementation, the obtained CFV isof a dimension of d×1.

(d) Update the baseline according to Eq. (19). This step includes aweighted average CFV by a decaying factor γ and a probability ofconsidering the non-terminal state, q(I_(i)), which can be computedbased on a product of sampling probabilities of a sequence of actionsthat leads to I_(i). In a vector-form implementation, the resultingupdated baseline is of a dimension of d×1.

(e) Recursively compute (b)-(d) along the game tree or the public treeuntil reaching the root node in the current iteration. The computedbaseline-corrected CFV of each node can be used to compute the regret,cumulative regret, current strategy, and average strategy following thesimilar formulation of the original MCCFR, for example, according toEqs. (10), (12), (7), and (8), respectively.

The above steps in the iteration can be repeated until a convergencecondition is reached. The current strategy or average strategy afterreaching convergence can be returned as an output of the MCCFR toapproximate the Nash equilibrium.

It can be proven that the variance reduction with a counterfactual valuebaseline method maintains an unbiased estimate of counterfactual value.That is, if the baseline-correct counterfactual values are defined byEq. (17) and Eq. (18), then ∀i ∈ N, I_(i) ∈ I_(i), a ∈ A(I_(i)), σ^(t),it holds that E_(z)[{circumflex over (v)}_(i) ^(σ) ^(t) (I_(i),a|Q_(j))]=v_(i) ^(σ) ^(t) (a|I_(i)).

In some embodiments, the variance reduction techniques based on thecounterfactual value baseline requires less computation than the onebased on the expected utility value baseline. For example, according toEq. (4), the counterfactual value v_(i) ^(σ) ^(t) (I_(i)) can becomputed as the multiplication of the opponent's range matrix Π_(−i)^(σ) ^(t) (I_(i)) ∈

^(1×d) and the expected utility value matrix U_(i) ^(σ) ^(t) [I_(i)]∈

^(s×1). When using the vector-form MCCFR, the variance reductiontechniques based on the expected utility value baseline maintains a d×dmatrix as the baseline, and use this baseline in a control variate toupdate the baseline-corrected expected utility value, which is a d×dmatrix. After that, the estimate counterfactual value is themultiplication of opponent's range matrix (1×d matrix), thebaseline-enhanced expected utility value (d×d matrix), and

$\frac{1}{q(z)}{\left( {1 \times 1\mspace{14mu} {matrix}} \right).}$

Different from expected utility value baseline, the variance reductiontechniques with the counterfactual value baseline is morecomputationally efficient. In some embodiments, the counterfactual valueof the vector of information set {right arrow over (I)}_(i) is a 1×dmatrix. As defined in Eq. (19), the counterfactual value baseline isupdated on counterfactual values. The baseline corresponding to {rightarrow over (I)}_(i) is a 1×d matrix. Eq. (17) and Eq. (18) are thesummation or aggregation over several 1×d matrixes corresponding to{right arrow over (I)}_(i). For non-terminal states, the counterfactualvalues or baseline-corrected counterfactual values can be updated basedon the summation or aggregation as shown in the lower (or second)equation of Eq. (18). By contrast, for the variance reduction techniqueswith the expected utility value baseline, the counterfactual values orbaseline-corrected counterfactual values are updated based onmultiplication (e.g., as shown in Eq. (4)) for all terminal andnon-terminal states. As such, the computational load saved by thecounterfactual value baseline relative to expected utility valuebaseline can depend on a depth of and/or a number of non-terminal statesin the game tree or public tree that represents the environment or theIIG. The MCCFR with counterfactual value baseline is even morecomputationally efficient than the ones based on expected utility valuebaseline, if the game tree or public tree is deep and/or has a largenumber of non-terminal states. As an example, in HUNL, d=1326⁴. Theexpected-value-based method needs to conduct at least 1326×1326 addoperations to update its baseline while the counterfactual-value-basedmethod only needs 1×1326 add operations.

In some embodiments, exploration techniques can be applied to MCCFR toachieve better performance with fewer samples. In some embodiments, ahybrid or mixture sampling policy can be used to balance exploitationand exploration, which are trade-offs in MCCFR that learn state-actionpolicy from the sampling experience. In some embodiments, the hybridsampling policy can be represented by:

σ_(i) ^(se)(a|I_(i))=(1−α)*σ_(i) ^(s)(a|I _(i))+α*σ_(i) ^(e)(a|I _(i)),  (20)

where σ_(i) ^(s)(a|I_(i)) refers to a sampling policy, σ_(i)^(e)(a|I_(i)) refers to an exploration policy. α ∈ [0,1] refers to themixture factor, which is used to control the weight of exploration.Typically, α is a decay factor. For example, set

${\alpha = \frac{1}{\ln \left( {t + {10}} \right)}},$

who holds lim_(t→∞) α=0. The sampling policy σ_(i) ^(s), can be anysuitable sampling policy including RCS σ_(i) ^(rcs), MCS σ_(i) ^(mcs),WAS σ_(i) ^(was), outcome sampling, external sampling, robust sampling,or uniform sampling. In some embodiments, both σ_(i) ^(s) and σ_(i) ^(e)holds Σ_(a∈A(I) _(i) ₎σ_(i) ^(s)(a|I_(i))=1, Σ_(a∈A(I) _(i) ₎σ_(i)^(e)(a|I_(i))=1. Therefore, Σ_(a∈A(I) _(i) ₎σ_(i) ^(se)(a|I_(i))=1.

Define ΔC^(t)(a|I_(i)) as the sampling times for action a at infosetI_(i) in iteration t. If the infoset I_(i) or action a is not sampled inthis iteration, the ΔC^(t)(a|I_(i)) is 0. The cumulative sampling timescan be computed by

C ^(t)(a|I _(i))=Σ_(j=1) ^(t) ΔC ^(j)(a|I _(i)),   (21).

In value-form MCCFR, such as outcome sampling, if the action a issampled at infoset I_(i) in iteration t, set ΔC^(t)(a|I_(i))=1. Invector-form MCCFR, when {right arrow over (I)}_(i) is sampled,ΔC^(t)(a|I_(i)) for each infoset I_(i) ∈ {right arrow over (I)}_(i)should be updated accordingly. In some embodiments, a single counter isused for the entire vector of information set {right arrow over (I)}_(i)for calculating the times of the action a is sampled. In someembodiments, a mini-batch MCCFR (which is described in PCT App. No.PCT/CN2019/072200, filed on Jan. 17, 2019, entitled “SAMPLING SCHEMESFOR STRATEGY SEARCHING IN STRATEGIC INTERACTION BETWEEN PARTIES” and inU.S. application Ser. No. 16/448,390, filed on Jun. 21, 2019, entitled“SAMPLING SCHEMES FOR STRATEGY SEARCHING IN STRATEGIC INTERACTIONBETWEEN PARTIES.”) is used, ΔC^(t)(a|I_(i)) could be larger than 1because a mini-batch of blocks are sampled in one iteration. Theexploration policy can be computed by

$\begin{matrix}{{{\sigma_{i}^{e,t}\left( a \middle| I_{i} \right)} = \frac{\left( {1 + \frac{\beta}{\sqrt{c^{t}\left( a \middle| I_{i} \right)}}} \right)}{\sum_{a \in {A{(I_{i})}}}\left( {1 + \frac{\beta}{\sqrt{c^{t}\left( a \middle| I_{i} \right)}}} \right)}},} & (22)\end{matrix}$

where σ_(i) ^(e,t) refers to the exploration policy in iteration t, β isa nonnegative real number. If β=0, then σ_(i) ^(e,t)(a|I_(i)) is auniform random exploration. If β>0 and action a at I_(i) is sampled overand over again, σ_(i) ^(e,t)(a|I_(i)) tends to become small so thatthere is a potentially smaller probability to sample this action thanthe one without exploration. Exploration is empirically helpful inMCCFR. For example, if a cumulative regret of one action is negative,its current strategy is zero. In this situation, this action will not besampled in the next iterations. However, this action could have a largeroverall regret than other actions after long running iterations.Therefore, it will need a lot of iterations to make MCCFR change itsnegative regret to a positive value. When using exploration, MCCFR has acertain probability to sample this action and makes an exploration aftersome iterations.

Experiments have been carried out to evaluate the example techniques toaccelerate the convergence of MCCFR on three different poker games:heads-up no-limit preflop hold'em poker (NLPH), heads-up no-limit flophold'em poker (NLFH) and the river subgame of headsup no-limit Texashold'em poker (HUNL-R). The techniques include the vector-form samplingpolicies, the variance reduction techniques with the counterfactualvalue baseline, the hybrid sampling policy with exploration, and hybridMCCFR variants with skipping mechanism and discounting updates. Theexperiment results show the described MCCFR variants obtain 2 or 3orders of magnitude improvement.

HUNL is a primary benchmark for the imperfect information game solvingmethods. The HUNL used in this experiment is the standard version in theAnnual Computer Poker Competition. At the start of HUNL, the two playershave 20000 chips. HUNL has at most four betting rounds if neitherplayers fold in advance. The four betting rounds are named by preflop,flop, turn, and river respectively. At the start of each hand, bothplayers are dealt with two private cards from a 52-card deck. One playerat the position of the small blind should firstly put 50 chips into thepot and the other player at the big blind should put 100 chips into thepot. Their positions alternate after each hand. Each player can choosefold, call, or raise. If one player chooses fold, then she will lose themoney in the pot and this hand is over. If one player chooses call, sheshould place a number of chips into the pot so that her total chips areequal to the opponent's chips. If one player chooses raise, she shouldadd more chips into the pot than the opponent does. After the preflopround, three public cards are revealed and then the flop betting roundoccurs. After this round, another public card is dealt and the thirdbetting round takes place. After that, the last public card is revealed,then the river round begins.

HUNL contains about 10¹⁶¹ infosets and is too large to traverse all thenodes. To reduce the computation, abstraction techniques such as actionabstraction or card abstraction techniques can be used to solve thesubgame of the full HUNL in real time. This experiment uses 1× the potand all in the each betting round without any card abstraction.

NLPH has only one betting round and the value for the terminal node isrepresented by the expected game utility under the uniform randomcommunity cards, which is precomputed and saved on the disk. NLPHcontains 7.8×10⁴ infosets and 1.0×10⁹ states. NLFH is similar to HUNLexcept there are only the first two betting rounds (preflop and flop)and three community cards. NLFH is a large game and contains more than4.3×10⁹ infosets and 5.8×10¹² states. The HUNL-R used in our experimentrefers to the forth betting round of HUNL. At the start of the round,there is $100 in the pot for each player and the ranges of both playersare specified by a uniform random policy. HUNL-R contains 2.6×10⁴infosets and 3.5×10⁷ states.

A set of ablation studies are conducted related to different samplingpolicies, exploration techniques, and variance reduction techniques witha counterfactual baseline on NLPH. Then different MCCFR methods arecompared on HUNL-R and extremely large NLFH.

All the experiments were evaluated by exploitability, which was used asa standard win rate measure. The method with a lower exploitability isbetter. Nash equilibrium has zero exploitability. The unit ofexploitability in this specification is millibig blinds per game(mbb/g). It denotes how many thousandths of a big blind one player winson average per hand of poker. For the abstracted large games, theexploitability is computed on the abstracted game. In the experiment,

${\alpha = \frac{1}{\ln \left( {t + {10}} \right)}},$

β=ln(t+10), and γ=0.5. Other values can be used. The experiments followthe typical procedure of MCCFR to traverse the public tree or game treeseparately for each player. FIGS. 2-6 show examples of simulationresults of multiple MCCFR variants in the experiments. The x-axis ofeach figure represents the number of iterations, and the y-axis of eachfigure represents the exploitability. Without loss of generality, robustsampling is used as an example sampling scheme for different MCCFRvariants on NLPH poker. One effective version of robust sampling is thetraverser samples 1 action according to the uniform random policy andthe opponent samples 1 action according to her current strategy.

FIG. 2 is a log-log plot 200 illustrating convergence performances ofseveral MCCFR variants applied to NLPH with different sampling policiesin accordance with embodiments of this specification. MCCFR refers tothe semi-vector-form MCCFR. MCCFR-RCS, MCCFR-MCS, MCCFR-WCS, andMCCFR-WAS refer to the vector-form MCCFR variants with different MCS,WCS, and WAS sampling policies, respectively. The results showed thatMCCFR-RCS achieved similar convergence with semi-vector-form MCCFRbecause RCS randomly selected infoset I_(i) from {right arrow over(I)}_(i) and sampled one action according to σI^(i). Such randomselection does not consider the importance of different infosets. Exceptfor MCCFR-RCS, other vector-form MCCFR variants achieve 2 or 3 orders ofmagnitude improvement against the semi-vector-form MCCFR. The WCS andWAS, which weighted each infoset by the range, have better long-termperformance than MCS. Note that, typically semi-vector-form MCCFRconverges faster than its value-form version so a convergence curve forthe value-form MCCFR is not presented in FIG. 2. In the remainingexperiments, WCS weighted by both player's ranges is selected as thesampling policy.

FIG. 3 is a log-log plot 300 illustrating convergence performances ofseveral MCCFR variants applied to NLPH with and without explorationtechniques in accordance with embodiments of this specification.Specifically, the convergence curves 310, 320, 330 and 340 correspond toMCCFR, MCCFR-WCS without exploration, MCCFR-WCS with an ε-greedyexploration, and MCCFR-WCS with the example exploration techniquedescribed w.r.t. Eq. (20), respectively. FIG. 3 shows that MCCFR-WCSoutperforms MCCFR, and MCCFR-WCS with ε-greedy exploration and theexample exploration technique described w.r.t. Eq. (20) outperformMCCFR-WCS in terms of the convergence performances. Moreover, MCCFR-WCSwith the example exploration technique described w.r.t. Eq. (20)converges even faster than the one with the ε-greedy exploration,because the former exploration technique takes into considerationsampled frequencies of different actions.

FIG. 4A is a log-log plot 400 illustrating convergence performances ofseveral MCCFR variants applied to NLPH with and without differentvariance reduction techniques in accordance with embodiments of thisspecification. Specifically, the convergence curves 410, 420, 430, 440,and 450 correspond to MCCFR, MCCFR-WCS without any variance reductiontechnique, with a variance reduction technique using an expected utilityvalue baseline (denoted as MCCFR-WCS(ev b)), with a variance reductiontechnique using the CFV baseline described w.r.t. Eq. (19) (denoted asMCCFR-WCS(cfv b)), and with a variance reduction technique using the CFVbootstrapping baseline (denoted as MCCFR-WCS(cfv b, boot)),respectively.

As shown in FIG. 4A, vector-form MCCFR variants converge faster whenusing variance reduction technique (e.g., control variate techniques).Moreover, the variance reduction technique using the CFV baseline (e.g.,both MCCFR-WCS(cfv b) and MCCFR-WCS(cfv b, boot)) outperforms the onewith expected utility value baseline, MCCFR-WCS(ev b). Furthermore, theMCCFR with expected utility value baseline needs to conduct 1326×1326add operations for each sampled node, which is much more time-consumingthan our counterfactual value baseline. To make a fair comparison, theconvergence comparison by running time is provided in FIG. 4A.

FIG. 4B is a log-log plot 405 illustrating example computationalefficiencies of several MCCFR variants applied to NLPH with and withoutdifferent variance reduction techniques in accordance with embodimentsof this specification. In the experiment, a semi-vector-form MCCFR(denoted as MCCFR) costs 5.9 seconds every 1000 iterations; thevector-form MCCFR-WCS (denoted as MCCFR-WCS) costs 6.2 seconds; themethod with counterfactual baseline (e.g., either MCCFR-WCS(cfv b) andMCCFR-WCS(cfv b, boot)) costs 6.5 seconds and the method with expectedutility value baseline (denoted as MCCFR-WCS(ev b)) costs 48.7 seconds.

Although the vector-form MCCFR samples more infosets thansemi-vector-form MCCFR on each iteration, they cost similar computationtime because of the benefit of the matrix manipulation. Empirically, themethod with bootstrapping baseline (MCCFR-WCS(cfv b, boot)) convergedslightly faster than the one using the CFV baseline described w.r.t. Eq.(19) (denoted as MCCFR-WCS(cfv b)). In the remaining experiment, themethod with bootstrapping counterfactual baseline is selected as adefault MCCFR variant.

FIGS. 5A-5C are log-log plots 500, 530, and 560 illustrating convergenceperformances of several MCCFR variants by external sampling on threedifferent poker games: NLPH, HUNL-R and NLFH, in accordance withembodiments of this specification. FIGS. 5A-5C show that MCCFR with thedescribed WCS sampling policy and bootstrapping baseline cansignificantly improve the convergence of MCCFR in many settings(including an extremely large game NLFH). The improved MCCFR couldbenefit many poker AIs and help them achieve better strategy in lessrunning time.

FIG. 6A is log-log plot 600 illustrating convergence performances ofseveral MCCFR variants with and without skipping on NLPH in accordancewith embodiments of this specification. The experiments are performed byexternal sampling vector-form MCCFR. In CFR, the cumulative regret isinitialized by zero and the current strategy starts from a uniformrandom strategy. In some embodiments, only the average strategy profilewithin all iterations is proved to converge to Nash equilibrium. In someembodiments, skipping previous iterations of CFR can obtain fasterconvergence of MCCFR. FIG. 6A shows that the MCCFR variants withdifferent skipping iterations significantly improve the performance onNLPH. FIG. 6A shows the long-term performance of the MCCFR algorithm onNLPH over a long iteration horizon. The method with skipping previous10000 iterations (denoted as WCS(skip 10k)) converged to 0.94-Nashequilibrium. This exploitability was considered sufficiently convergedin Texas hold'em.

FIG. 6B is log-log plot 650 illustrating convergence performances ofMCCFR variants with and without skipping on NLPH in accordance withembodiments of this specification. The experiments are performed byexternal sampling vector-form MCCFR. As a discounting mechanism, alinear MCCFR weights the regrets and average strategies with a valuedependent on the iteration t. In the experiment, this discountingmechanism is combined with the vector-form MCCFR with a specified weightof t^(w). FIG. 6B shows that the linear MCCFR with a weight t^(w), wherew=1 and w=2 (denoted as Linear WCS (w=1) and Linear WCS (w=1)), improvesthe convergence more than the vector-form MCCFR without discounting(denoted as WCS).

FIG. 7 is a flowchart of an example of a process 700 for performingMonte Carlo counterfactual regret minimization (MCCFR) for determiningaction selection policies for software applications in accordance withembodiments of this specification. The process 700 can be an example ofthe MCCFR algorithm with a sampling scheme described above.

The example process 700 shown in FIG. 7 can be modified or reconfiguredto include additional, fewer, or different operations, which can beperformed in the order shown or in a different order. In some instances,one or more of the operations can be repeated or iterated, for example,until a terminating condition is reached. In some implementations, oneor more of the individual operations shown in FIG. 7 can be executed asmultiple separate operations, or one or more subsets of the operationsshown in FIG. 7 can be combined and executed as a single operation.

In some embodiments, the process 700 can be performed in an iterativemanner, for example, by performing two or more iterations. In someembodiments, the process 700 can be used in automatic control, robotics,or any other applications that involve action selections. In someembodiments, the process 700 can be performed by an execution device forgenerating an action selection policy (e.g., a strategy) for completinga task (e.g., finding Nash equilibrium) in an environment that includesthe execution device and one or more other devices. In some embodiments,generating the action selection policy can include some or alloperations of the process 700, for example, by initiating an actionselection policy at 702 and updating the action selection policy at 750over iterations. The execution device can perform the process 700 in theenvironment for controlling operations of the execution device accordingto the action selection policy.

In some embodiments, the execution device can include a data processingapparatus such as a system of one or more computers, located in one ormore locations, and programmed appropriately in accordance with thisspecification. For example, a computer system 1000 of FIG. 10,appropriately programmed, can perform the process 700. The executiondevice can be associated with an execution party or player. Theexecution party or player and one or more other parties (e.g.,associated with the one or more other devices) can be participants orplayers in an environment, for example, for strategy searching instrategic interaction between the execution party and one or more otherparties.

In some embodiments, the environment can be modeled by an imperfectinformation game (IIG) that involves two or more players. In someembodiments, the process 700 can be performed to solve an IIG, forexample, by the execution party supported by the application. The IIGcan represent one or more real-world scenarios such as resourceallocation, product/service recommendation, cyber-attack predictionand/or prevention, traffic routing, fraud management, etc., that involvetwo or more parties, where each party may have incomplete or imperfectinformation about the other party's decisions. As an example, the IIGcan represent a collaborative product-service recommendation servicethat involves at least a first player and a second player. The firstplayer may be, for example, an online retailer that has customer (oruser) information, product and service information, purchase history ofthe customers, etc. The second player can be, for example, a socialnetwork platform that has social networking data of the customers, abank or another financial institution that has financial information ofthe customers, a car dealership, or any other party that may haveinformation of the customers on the customers' preferences, needs,financial situations, locations, etc. in predicting and recommendingproducts and services to the customers. The first player and the secondplayer may each have proprietary data that the player does not want toshare with others. The second player may only provide partialinformation to the first player at different times. As such, the firstplayer may only have limited access to the information of the secondplayer. In some embodiments, the process 700 can be performed for makinga recommendation to a party with the limited information of the secondparty, planning a route with limited information.

At 702, an action selection policy (e.g., a strategy σ_(i) ^(t)) in afirst iteration, i.e., t=1 iteration, is initialized. In someembodiments, an action selection policy can include or otherwise specifya respective probability (e.g., σ_(i) ^(t)(a_(j)|I_(i))) of selecting anaction (e.g., a_(j)) among a plurality of possible actions in a state(e.g., a current state i) of the execution device (e.g., the device ofthe execution device that performs the process 700). The current stateresults from a previous action taken by the execution device in aprevious state, and each action of the plurality of possible actionsleads to a respective next state if performed by the execution devicewhen the execution device is in the current state.

In some embodiments, a state can be represented by a node of the gametree (e.g., a non-terminal node 123, 127, 143 b, or 147 b or a terminalnode 143 a, 153 a, 153 b, 143 c, 143 d, 147 a, 157 a, 157 b, 147 c, or147 d of the game tree 100). In some embodiments, the state can be apublic state represented by a node of a public tree (e.g., anon-terminal node 125, 135 a, 135 b, or 145 b, or a terminal node 145 a,145 c, 145 d, 155 a, or 155 b of the public tree 150).

In some embodiments, the strategy can be initialized, for example, basedon an existing strategy, a uniform random strategy (e.g. a strategybased on a uniform probability distribution), or another strategy (e.g.a strategy based on a different probability distribution). For example,if the system warm starts from an existing CFR method (e.g., an originalCFR or MCCFR method), the iterative strategy can be initialized from anexisting strategy profile to clone existing regrets and strategy.

At 704, whether a convergence condition is met is determined. MCCFRtypically includes multiple iterations. The convergence condition can beused for determining whether to continue or terminate the iteration. Insome embodiments, the convergence condition can be based onexploitability of a strategy σ. According to the definition ofexploitability, exploitability should be larger than or equal to 0. Thesmaller exploitability indicates a better strategy. That is, theexploitability of converged strategy should approach 0 after enoughiterations. For example, in poker, when the exploitability is less than1, the time-average strategy is regarded as a good strategy and it isdetermined that the convergence condition is met. In some embodiments,the convergence condition can be based on a predetermined number ofiterations. For example, in a small game, the iterations can be easilydetermined by the exploitability. That is, if exploitability is smallenough, the process 700 can terminate. In a large game, theexploitability is intractable and typically a large parameter foriteration can be specified. After each iteration, a new strategy profilecan be obtained, which is better than the old one. For example, in alarge game, the process 700 can terminate after a sufficient number ofiterations.

If the convergence condition is met, no further iteration is needed. Theprocess 700 proceeds to 706. Operations of the execution device arecontrolled according to the each current action selection policy in thevector of current action selection policies. For example, the eachcurrent action selection policy in the current iteration, or an averageaction selection policy across the t iterations can be output as controlcommands to control one or more of a direction, speed, distance, orother operation of an engine, motor, valve, actuator, accelerator,brake, or other device in an autonomous vehicle or other applications.If the convergence condition is not met, t is increased by 1, and theprocess 700 proceeds to a next iteration, wherein t>1.

In a current iteration (e.g., t-th iteration), at 710, a plurality ofpossible actions in a state of the execution device is identified. Insome embodiments, as mentioned above, the state can be a public staterepresented by a node of a public tree (e.g., a non-terminal node 125,135 a, 135 b, or 145 b, or a terminal node 145 a, 145 c, 145 d, 155 a,or 155 b of the public tree 150). The state can correspond to a vectorof information sets, and each information set in the vector ofinformation sets comprises a sequence of actions taken by the executiondevice that leads to the state. For example, as shown in FIG. 1B, thestate represented by the node in the public tree 150 can maintains avector of infosets {right arrow over (I)}_(i)=[I_(i1), I_(i2), I_(i3)].

In some embodiments, the state corresponds to a public sequence thatcomprises one or more actions publically known by the execution deviceand the one or more other devices that in a trajectory starting from aninitial state (e.g., a root node of the public tree) and ending in thestate. For example, the state of the node 155 b corresponds to a publicsequence (e.g., the public sequences [A_(1a), A_(2b), A_(3b),]) thatcomprises one or more actions publically known by the execution device(e.g., A_(1a), and A_(3b),]) and the one or more other devices (e.g.,A_(2b)) from the root node 125 to following the node 155 b. The eachinformation set in the vector of information sets comprises the publicsequence. In some embodiments, the each information set in the vector ofinformation sets also comprises one or more non-public actions (e.g.,taken by the execution device or chance) along the trajectory from aninitial state (e.g., a root node of the public tree) and ending in thestate. For example, each information set in the vector of informationsets at the state of node 155 b comprises the public sequence [A_(1a),A_(2b), A_(3b),], and respective non-public actions (e.g., card J, Q, Kdealt by chance).

As shown in FIG. 1B, with a corresponding vector of information sets,the state represented by the node of the public tree that represents theenvironment is associated with or corresponds to a plurality of possibleactions in the state. For example, the node 125 as shown in public tree150 is associated with multiple actions (e.g., actions A_(1a), andA_(1b)) of the state that lead to respective next states (e.g., node 135a and node 135 b). As another example, another state (e.g., node 145 b)of the execution device is associated with multiple actions (e.g.,actions A_(3a), and A_(3b)) of the state that lead to respective nextstates (e.g., node 155 a and node 155 b), where the node 145 b resultsfrom a previous action A_(1b) taken by the execution device in aprevious state (e.g., node 135 a).

In some embodiments, the plurality of possible actions in the state ofthe execution device is identified, for example, by reading a datastructure representing the environment (e.g., a public tree of an IIG).The data structure can include respective plurality of possible actionsin each of the states of the environment.

At 720, a vector of current action selection policies in the state(e.g., a vector of current strategies {right arrow over(σ_(l))}=[σ_(i1), σ_(i2), σ_(i3)]) is identified. In some embodiments,the vector of current action selection policies in the state is anaction selection policy in the state in the current iteration t (but theannotate of the iteration t is omitted for simplicity). In someembodiments, each current action selection policy in the vector ofcurrent action selection policies corresponds to an information set inthe vector of information sets (e.g., the vector of infosets {rightarrow over (I)}_(i)=[I_(i1), I_(i2), I_(i3)]. ). The action selectionpolicy specifies a respective probability of selecting an action amongthe plurality of possible actions in the state. For example, the actionselection policy an corresponds to I_(i1) in the vector of infosets{right arrow over (I)}_(i) in the state. If the state is the node 125 ofpublic tree 150 in FIG. 1B, the action selection policy σ^(i1) specifiesa probability of selecting the action A_(1a) and a probability ofselecting the action A_(1b) in the state under the action selectionpolicy σ_(i1) in the current iteration t.

In some embodiments, the vector of current action selection policies inthe state in the current iteration is identified by identifying aninitial vector of current action selection policies in the state at 702,or by identifying an updated vector of current action selection policiesin the state in a previous iteration, for example, according to 750.

At 730, a sampling policy is computed based on the vector of currentaction selection policies in the state, wherein the sampling policyspecifies a respective sampling probability corresponding to each of theplurality of possible actions in the state. In some embodiments, thesampling policy comprises a probability distribution over the pluralityof actions at the state.

Note that a sampling policy is different from an action selection policy(e.g., the current action selection policy), although both can be aprobability distribution across the plurality of possible actions in thestate. The sampling policy is used in MCCFR to determine whichtrajectories or paths in an environment to sample in a Monte Carlomethod, rather than traversing all possible trajectories or paths in theenvironment. The sampling policy is used to compute a probability of asampled terminal trajectory (i.e., a sequence of actions), which is usedfor computing sampled counterfactual value (also referred to as estimatecounterfactual value) to approximate a (actual) counterfactual valuethat is computed based on traversing all the possible trajectories orpaths in the environment.

On the other hand, regardless sampling is used or not, the actionselection policy can be a strategy that specifies and/or comprises arespective probability (e.g., σ_(i)(a|h)) of selecting an action a amongthe plurality of possible actions in the state h under the strategy, forexample, to complete the task and approach a Nash Equilibrium. Theaction selection policy can be updated in each iteration of a CFRalgorithm. In some embodiments, the output of the CFR algorithm can bethe action selection policy (or an average action selection policyacross multiple iterations) that specifies a respective probability ofselecting an action among the plurality of possible actions in eachstate of the IIG (under the strategy of the best response as describedw.r.t. Eq. (2)) so that the player can approximate or achieve the NashEquilibrium.

In some embodiments, in MCCFR, once an action is sampled according tothe sampling policy, then the action selection policy can be updatedbased on one or more of a regret, CFV, and other values calculated basedon the sampled action. In some embodiments, a sampling policy isindependent from the action selection policy, for example, in avalue-form MCCFR. In a vector-form MCCFR, there are multiple actionselection policies in a state (e.g., a vector of action selectionpolicies corresponding to the vector of information sets). In someembodiments, the sampling policy can be independent from the actionselection policy (e.g., according to a uniform or another specifieddistribution). In some embodiments, the sampling policy can be computedbased on the multiple action selection policies that correspond to thevector of information sets.

In some embodiments, the sampling policy can be computed based on thevector of current action selection policies in the state, for example,according to Random Current Strategy (RCS), Mean Current Strategy (MCS),Weighted Current Strategy (WCS), Weighted Average Strategy (WAS), or anyother method that relates the sampling policy to the multiple currentaction selection policies in the state.

In some embodiments, computing a sampling policy based on the vector ofcurrent action selection policies of the execution device in the statecomprises computing the sampling probability corresponding to each ofthe plurality of possible actions in the state as a mean value ofcurrent action selection policies of each of the plurality of possibleactions in the state over the vector of information sets, for example,according to Eq. (13).

In some embodiments, computing a sampling policy based on the vector ofcurrent action selection policies of the execution device in the statecomprises computing the sampling probability corresponding to each ofthe plurality of possible actions in the state based on current actionselection policies of each of the plurality of possible actions in thestate and respective reach probabilities of the vector of informationsets. In some embodiments, computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on current action selection policies of each of the plurality ofpossible actions in the state and respective reach probabilities of thevector of information sets comprises computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on a sum of the current action selection policies of each of theplurality of possible actions in the state weighted by the respectivereach probabilities of the vector of information sets, for example,according to Eq. (14).

In some embodiments, computing a sampling policy based on the vector ofcurrent action selection policies of the execution device in the statecomprises computing the sampling probability corresponding to each ofthe plurality of possible actions in the state based on average actionselection policies of each of the plurality of possible actions in thestate and respective reach probabilities of the vector of informationsets. In some embodiments, computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on average action selection policies of each of the plurality ofpossible actions in the state and respective reach probabilities of thevector of information sets comprises computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on a sum of the average action selection policies of each of theplurality of possible actions in the state weighted by the respectivereach probabilities of the vector of information sets, for example,according to Eq. (15).

At 740, an action among the plurality of possible actions in the stateis sampled according to a sampling probability of the action specifiedin the sampling policy. For example, for player 1 at the node 135 a ofthe game tree 150 in FIG. 1B, the sampling policy can include a samplingprobability of sampling the action A_(1a) among the two possible actionsA_(1a) and A_(1b) in the state of the node 135 a (say a probability of0.3), and a sampling probability of sampling the action A_(1b) among thetwo possible actions A_(1a) and A_(1b) in the state of the node 135 a(say a probability of 0.7). The action A_(1b) can be sampled with ahigher probability of 0.7 at the node 135 a than the action A_(1a). Thesampled action A_(1b) can be used for updating the current actionselection policy for the next iteration.

At 750, the each current action selection policy in the vector ofcurrent action selection policies of the execution device in the stateis updated based on the action (e.g., sampled action A_(1b) in the aboveexample). In some embodiments, updating the each current actionselection policy in the vector of current action selection policies ofthe execution device in the state based on the action comprisesperforming Monte Carlo counterfactual regret minimization (MCCFR) basedon the action, for example, according to some or all of Eqs. (4)-(12).For example, updating the each current action selection policy in thevector of current action selection policies of the execution device inthe state based on the action comprises: calculating a probability of asampled terminal sequence of actions based on the sampling probabilityof the action (e.g., q(z)=Σ_(j:z∈Q) _(h) q_(Q) _(j) ), the sampledterminal sequence of actions including the action and a terminal statefor completing a task; calculating a sampled counterfactual value of theaction based on the probability of the sampled terminal sequence ofactions (e.g., according to Eq. (9)); calculating a regret value of theaction based on the sampled counterfactual value of the action (e.g.,according to some or all of Eqs. (10)-(12)); and updating the each ofthe vector of current action selection policies of the execution devicein the state based on the regret value of the action (e.g., according toregret matching based on Eq. (7) or regret matching+). In someembodiments, an average strategy σ _(i) ^(t) after the current iterationcan be computed, for example, according to Eq. (8).

After 750, the process 700 can go back to 704 to determine whether aconvergence condition is met. In some embodiments, in response todetermining that the convergence condition is met, operations of theexecution device are controlled based on the action selection policy. Insome embodiments, in response to determining that the convergencecondition is met, an average action selection policy across alliterations (e.g., from the first iteration to the current iteration) ineach state can be computed, for example, according to Eq. (8). In someembodiments, the average action selection policy can serve as an outputof the process 700, for example, as the computed Nash equilibrium.

In some embodiments, the action selection policy can serve as an outputof the software-implemented application to automatically control theexecution device's action at each state, for example, by selecting theaction that has the highest probability among a plurality of possibleactions based on the action selection policy. As an example, theenvironment comprises a traffic routing environment, the executiondevice supported by the application comprises a computer-assistedvehicle, the action selection policy comprises a route selection policyfor controlling directions of the computer-assisted vehicle, andcontrolling operations of the execution device according to the actionselection policy comprises controlling directions of thecomputer-assisted vehicle according to the route selection policy.Controlling operations of the computer-assisted vehicle may includecontrolling one or more of a throttle, steering, braking, navigation,engine mode to achieve directions, speeds, other parameters specified inthe route selection policy that is generated according to the process700 to complete the task of, for example, reaching a desired destinationin the environment that includes other computer-assisted vehiclessharing roads.

FIG. 8 is a flowchart of an example of another process 800 forperforming Monte Carlo counterfactual regret minimization (MCCFR) fordetermining action selection policies for software applications inaccordance with embodiments of this specification. The process 800 canbe an example of the MCCFR algorithm with a hybrid sampling scheme withexploration as described above. Note that the process 800 can be appliedin value-form, semi-vector-form, and vector-form MCCFR. In someembodiments, the process 800 can be combined with the process 700, forexample, by replacing the sampling policy in process 700 with the hybridsampling policy in process 800.

The example process 800 shown in FIG. 8 can be modified or reconfiguredto include additional, fewer, or different operations, which can beperformed in the order shown or in a different order. In some instances,one or more of the operations can be repeated or iterated, for example,until a terminating condition is reached. In some implementations, oneor more of the individual operations shown in FIG. 8 can be executed asmultiple separate operations, or one or more subsets of the operationsshown in FIG. 8 can be combined and executed as a single operation.

In some embodiments, the process 800 can be performed in an iterativemanner, for example, by performing two or more iterations. In someembodiments, the process 800 can be used in automatic control, robotics,or any other applications that involve action selections. In someembodiments, the process 800 can be performed by an execution device forgenerating an action selection policy (e.g., a strategy) for completinga task (e.g., finding Nash equilibrium) in an environment that includesthe execution device and one or more other devices. In some embodiments,generating the action selection policy can include some or alloperations of the process 800, for example, by initiating an actionselection policy at 802 and updating the action selection policy at 850over iterations. The execution device can perform the process 800 in theenvironment for controlling operations of the execution device accordingto the action selection policy.

In some embodiments, the execution device can include a data processingapparatus such as a system of one or more computers, located in one ormore locations, and programmed appropriately in accordance with thisspecification. For example, a computer system 1000 of FIG. 10,appropriately programmed, can perform the process 800. The executiondevice can be associated with an execution party or player. Theexecution party or player and one or more other parties (e.g.,associated with the one or more other devices) can be participants orplayers in an environment, for example, for strategy searching instrategic interaction between the execution party and one or more otherparties.

In some embodiments, the environment can be modeled by an imperfectinformation game (IIG) that involves two or more players. In someembodiments, the process 800 can be performed for solving an IIG, forexample, by the execution party supported by the application. The IIGcan represent one or more real-world scenarios such as resourceallocation, product/service recommendation, cyber-attack predictionand/or prevention, traffic routing, fraud management, etc., that involvetwo or more parties, where each party may have incomplete or imperfectinformation about the other party's decisions. As an example, the IIGcan represent a collaborative product-service recommendation servicethat involves at least a first player and a second player. The firstplayer may be, for example, an online retailer that has customer (oruser) information, product and service information, purchase history ofthe customers, etc. The second player can be, for example, a socialnetwork platform that has social networking data of the customers, abank or another financial institution that has financial information ofthe customers, a car dealership, or any other parties that may haveinformation of the customers on the customers' preferences, needs,financial situations, locations, etc. in predicting and recommendingproducts and services to the customers. The first player and the secondplayer may each have proprietary data that the player does not want toshare with others. The second player may only provide partialinformation to the first player at different times. As such, the firstplayer may only have limited access to the information of the secondplayer. In some embodiments, the process 800 can be performed for makinga recommendation to a party with the limited information of the secondparty, planning a route with limited information.

At 802, similar to 702, an action selection policy (e.g., a strategyσ_(i) ^(t)) in a first iteration, i.e., t=1 iteration, is initialized.In some embodiments, an action selection policy can include or otherwisespecify a respective probability (e.g., σ_(i) ^(t)(a_(j)|I_(i))) ofselecting an action (e.g., a_(j)) among a plurality of possible actionsin a state (e.g., a current state i) of the execution device (e.g., thedevice of the execution device that perform the process 800). Thecurrent state results from a previous action taken by the executiondevice in a previous state, and each action of the plurality of possibleactions leads to a respective next state if performed by the executiondevice when the execution device is in the current state.

In some embodiments, a state can be represented by a node of the gametree (e.g., a non-terminal node 123, 127, 143 b, or 147 b or a terminalnode 143 a, 153 a, 153 b, 143 c, 143 d, 147 a, 157 a, 157 b, 147 c, or147 d of the game tree 100). In some embodiments, the state can be apublic state represented by a node of a public tree (e.g., anon-terminal node 125, 135 a, 135 b, or 145 b, or a terminal node 145 a,145 c, 145 d, 155 a, or 155 b of the public tree 150).

In some embodiments, the strategy can be initialized, for example, basedon an existing strategy, a uniform random strategy (e.g. a strategybased on a uniform probability distribution), or another strategy (e.g.a strategy based on a different probability distribution). For example,if the system warm starts from an existing CFR method (e.g., an originalCFR or MCCFR method), the iterative strategy can be initialized from anexisting strategy profile to clone existing regrets and strategy.

At 804, similar to 704, whether a convergence condition is met isdetermined. MCCFR typically includes multiple iterations. Theconvergence condition can be used for determining whether to continue orterminate the iteration. In some embodiments, the convergence conditioncan be based on exploitability of a strategy σ. According to thedefinition of exploitability, exploitability should be larger than orequal to 0. The smaller exploitability indicates a better strategy. Thatis, the exploitability of converged strategy should approach 0 afterenough iterations. For example, in poker, when the exploitability isless than 1, the time-average strategy is regarded as a good strategyand it is determined that the convergence condition is met. In someembodiments, the convergence condition can be based on a predeterminednumber of iterations. For example, in a small game, the iterations canbe easily determined by the exploitability. That is, if exploitabilityis small enough, the process 800 can terminate. In a large game, theexploitability is intractable and typically a large parameter foriteration can be specified. After each iteration, a new strategy profilecan be obtained, which is better than the old one. For example, in alarge game, the process 800 can terminate after a sufficient number ofiterations.

If the convergence condition is met, no further iteration is needed. Theprocess 800 proceeds to 806, and operations of the execution device arecontrolled according to the action selection policy. For example, theaction selection policy in the current iteration, or an average actionselection policy across the t iterations can be output as controlcommands to control one or more of a direction, speed, distance, orother operation of an engine, motor, valve, actuator, accelerator,brake, or other device in an autonomous vehicle or other applications.If the convergence condition is not met, t is increased by 1, and theprocess 800 proceeds to a next iteration, wherein t>1.

In a current iteration (e.g., t-th iteration), at 810, a sampling policyin a state of the execution device is identified. The sampling policyspecifies a respective sampling probability of sampling each of theplurality of possible actions in the state. In some embodiments, thesampling policy comprises a first probability distribution over theplurality of possible actions in the state. The sampling policy can beany one of the sample policies described with respect to FIG. 7. Forexample, the sampling policy can be one or more of a uniform samplingpolicy, a random sampling policy, a specified random policy, RandomCurrent Strategy (RCS), Mean Current Strategy (MCS), Weighted CurrentStrategy (WCS), Weighted Average Strategy (WAS), or any other samplingpolicy. In some embodiments, the sampling policy in the state of theexecution device is identified, for example, according to the exampletechniques described w.r.t. 730 of the process 700.

At 820, an exploration policy in the state of the execution device isidentified. The exploration policy specifies a respective explorationprobability corresponding to each of the plurality of possible actionsin the state, wherein the exploration probability is negativelycorrelated with a number of times that the each of the plurality ofpossible actions in the state has been sampled. In some embodiments, theexploration policy comprises a second probability distribution over theplurality of possible actions in the state.

In some embodiments, the exploration policy in the state of theexecution device is identified, for example, by computing theexploration probability of each of the plurality of possible actionsaccording to Eq. (22), wherein i represents an identifier of theexecution device (e.g., associated with player i); I_(i) represents aninformation set of the state; A(I_(i)) represents the plurality ofpossible actions in the state; a represents one of the plurality ofpossible actions in the state; t represents a current iteration;C^(t)(a|I_(i)) represents a number of times that the action a has beensampled in the state up to the iteration t; and σ_(i) ^(e,t)(a|I_(i))represents an exploration policy of exploring the action a at the statein iteration t; and β is a nonnegative real number.

At 830, a hybrid sampling policy is computed based on the samplingpolicy and the exploration policy. In some embodiments, computing ahybrid sampling policy based on a sampling policy and an explorationpolicy comprises: computing a probability of each of the plurality ofpossible actions in the state based on a weight sum of the samplingprobability of each of the plurality of possible actions in the stateand the exploration probability of each of the plurality of possibleactions in the state. In some embodiments, computing a hybrid samplingpolicy based on a sampling policy and an exploration policy comprisescomputing a probability of each of the plurality of possible actions inthe state according to Eq. (20), wherein I_(i) represents an informationset of the state; a represents one of the plurality of possible actionsin the state; σ_(i) ^(se)(a|I_(i)) represents the hybrid sampling policyof sampling the action a in the state; σ_(i) ^(e)(a|I_(i)) representsthe sampling policy of sampling the action a in the state; σ_(i)^(e)(a|I_(i)) represents the exploration policy of exploring the actiona in the state; and α ∈ [0,1] represents a factor that controls a weightof exploration.

At 840, an action among the plurality of possible actions in the stateis sampled according to a sampling probability of the action specifiedin the hybrid sampling policy. For example, for player 1 at the node 135a of the game tree 150 in FIG. 1B, the hybrid sampling policy caninclude a hybrid sampling probability of sampling the action A_(1a)among the two possible actions A_(1a) and A_(1b) in the state of thenode 135 a (say a probability of 0.2), and a hybrid sampling probabilityof sampling the action A_(1b) among the two possible actions A_(1a) andA_(1b) in the state of the node 135 a (say a probability of 0.8). Theaction A_(1b) can be sampled with a higher probability of 0.8 at thenode 135 a than the action A_(1a). The sampled action A_(1b) can be usedfor updating an action selection policy for the next iteration.

At 842, in response to sampling the action out of the plurality ofpossible actions in the state according to the hybrid sampling policy, anumber of times that the action has been sampled in the state isincreased. In some embodiments, the number of times that the action hasbeen sampled in the state comprises a number of times that the actionhas been sampled in the state up to the current iteration (e.g.,C^(t)(a|I_(i))).

At 844, the exploration probability corresponding to the action out ofthe plurality of possible actions in the state is decreased forcomputing the hybrid sampling policy in a next iteration (e.g., the(t+1)-th iteration) so that the action has a lower probability to besampled in the next iteration. In some embodiments, the explorationpolicy in the state of the execution device is decreased w.r.t. thenumber of times that the action has been sampled in the state, forexample, according to Eq. (22), or another function.

At 850, an action selection policy of the execution device in the stateis updated based on the action (e.g., sampled action A_(1b) in theexample described in 840) by performing Monte Carlo counterfactualregret minimization (MCCFR) based on the action. The action selectionpolicy specifies a respective probability of selecting an action amongthe plurality of possible actions in the state for completing the taskin the environment. The action selection policy can be, for example, acurrent strategy of the execution device in the state. For example, inMCCFR, once an action is sampled according to the sampling policy, thenthe action selection policy can be updated based on one or more of aregret, CFV, and other values calculated based on the sampled action.

In some embodiments, updating the action selection policy of theexecution device in the state based on the action comprises performingthe MCCFR based on the action, for example, according to some or all ofEqs. (4)-(12). For example, performing Monte Carlo counterfactual regretminimization (MCCFR) based on the action comprises: calculating aprobability of a sampled terminal sequence of actions (e.g.q(z)=Σ_(j:z∈Q) _(j) q_(Q) _(j) ) based on a hybrid sampling probabilityof the action, the sampled terminal sequence of actions including theaction and a terminal state for completing a task; calculating a sampledcounterfactual value of the action based on the probability of thesampled terminal sequence of actions (e.g., according to Eq. (9));calculating a regret value of the action based on the sampledcounterfactual value of the action (e.g., according to some or all ofEqs. (10)-(12)); and updating the action selection policy of theexecution device in the state based on the regret value of the action(e.g., according to regret matching based on Eq. (8) or regretmatching+). In some embodiments, an average strategy σ _(i) ^(t) afterthe current iteration can be computed, for example, according to Eq.(8).

After 850, the process 800 can go back to 804 to determine whether aconvergence condition is met is determined. In some embodiments, inresponse to determining that the convergence condition is met,operations of the execution device are controlled based on the actionselection policy. In some embodiments, in response to determining thatthe convergence condition is met, an average action selection policyacross all iterations (e.g., from the first iteration to the currentiteration) in each state can be computed. for example, according to Eq.(8). In some embodiments, the average action selection policy can serveas an output of the process 800, for example, as the computed Nashequilibrium.

In some embodiments, the action selection policy can serve as an outputof the software-implemented application to automatically control theexecution device's action at each state, for example, by selecting theaction that has the highest probability among a plurality of possibleactions based on the action selection policy. As an example, theenvironment comprises a traffic routing environment, the executiondevice supported by the application comprises a computer-assistedvehicle, the action selection policy comprises a route selection policyfor controlling directions of the computer-assisted vehicle, andcontrolling operations of the execution device according to the actionselection policy comprises controlling directions of thecomputer-assisted vehicle according to the route selection policy.

FIG. 9 is a flowchart of an example of another process 900 forperforming Monte Carlo counterfactual regret minimization (MCCFR) fordetermining action selection policies for software applications inaccordance with embodiments of this specification. The process 900 canbe an example of the MCCFR algorithm with variance reduction using a CFVbaseline as described above. Note that the process 900 can be applied invalue-form, semi-vector-form, and vector-form MCCFR. In someembodiments, the process 900 can be combined with the process 700 and/orprocess 800 to further improve convergence performance of the MCCFR.

The example process 900 shown in FIG. 9 can be modified or reconfiguredto include additional, fewer, or different operations, which can beperformed in the order shown or in a different order. In some instances,one or more of the operations can be repeated or iterated, for example,until a terminating condition is reached. In some implementations, oneor more of the individual operations shown in FIG. 9 can be executed asmultiple separate operations, or one or more subsets of the operationsshown in FIG. 9 can be combined and executed as a single operation.

In some embodiments, the process 900 can be performed in an iterativemanner, for example, by performing two or more iterations. In someembodiments, the process 900 can be used in automatic control, robotics,or any other applications that involve action selections. In someembodiments, the process 900 can be performed by an execution device forgenerating an action selection policy (e.g., a strategy) for completinga task (e.g., finding Nash equilibrium) in an environment that includesthe execution device and one or more other devices. In some embodiments,generating the action selection policy can include some or alloperations of the process 900, for example, by initiating an actionselection policy at 902 and updating the action selection policy at 916over iterations. The execution device can perform the process 900 in theenvironment for controlling operations of the execution device accordingto the action selection policy.

In some embodiments, the execution device can include a data processingapparatus such as a system of one or more computers, located in one ormore locations, and programmed appropriately in accordance with thisspecification. For example, a computer system 1000 of FIG. 10,appropriately programmed, can perform the process 900. The executiondevice can be associated with an execution party or player. Theexecution party or player and one or more other parties (e.g.,associated with the one or more other devices) can be participants orplayers in an environment, for example, for strategy searching instrategic interaction between the execution party and one or more otherparties.

In some embodiments, the environment can be modeled by an imperfectinformation game (IIG) that involves two or more players. In someembodiments, the process 900 can be performed for solving an IIG, forexample, by the execution party supported by the application. The IIGcan represent one or more real-world scenarios such as resourceallocation, product/service recommendation, cyber-attack predictionand/or prevention, traffic routing, fraud management, etc., that involvetwo or more parties, where each party may have incomplete or imperfectinformation about the other party's decisions. As an example, the IIGcan represent a collaborative product-service recommendation servicethat involves at least a first player and a second player. The firstplayer may be, for example, an online retailer that has customer (oruser) information, product and service information, purchase history ofthe customers, etc. The second player can be, for example, a socialnetwork platform that has social networking data of the customers, abank or another financial institution that has financial information ofthe customers, a car dealership, or any other parties that may haveinformation of the customers on the customers' preferences, needs,financial situations, locations, etc. in predicting and recommendingproducts and services to the customers. The first player and the secondplayer may each have proprietary data that the player does not want toshare with others. The second player may only provide partialinformation to the first player at different times. As such, the firstplayer may only have limited access to the information of the secondplayer. In some embodiments, the process 900 can be performed for makinga recommendation to a party with the limited information of the secondparty, planning a route with limited information.

At 902, similar to 702, an action selection policy (e.g., a strategyσ_(i) ^(t)) in a first iteration, i.e., t=1 iteration, is initialized.In some embodiments, an action selection policy can include or otherwisespecify a respective probability (e.g., σ_(i) ^(t)(a_(j)|I_(i))) ofselecting an action (e.g., a_(j)) among a plurality of possible actionsin a state (e.g., a current state i) of the execution device (e.g., thedevice of the execution device that perform the process 900). Thecurrent state results from a previous action taken by the executiondevice in a previous state, and each action of the plurality of possibleactions leads to a respective next state if performed by the executiondevice when the execution device is in the current state.

In some embodiments, a state can be represented by a node of the gametree (e.g., a non-terminal node 123, 127, 143 b, or 147 b or a terminalnode 143 a, 153 a, 153 b, 143 c, 143 d, 147 a, 157 a, 157 b, 147 c, or147 d of the game tree 100). In some embodiments, the state can be apublic state represented by a node of a public tree (e.g., anon-terminal node 125, 135 a, 135 b, or 145 b, or a terminal node 145 a,145 c, 145 d, 155 a, or 155 b of the public tree 150).

In some embodiments, the strategy can be initialized, for example, basedon an existing strategy, a uniform random strategy (e.g. a strategybased on a uniform probability distribution), or another strategy (e.g.a strategy based on a different probability distribution). For example,if the system warm starts from an existing CFR method (e.g., an originalCFR or MCCFR method), the iterative strategy can be initialized from anexisting strategy profile to clone existing regrets and strategy.

At 904, similar to 704, whether a convergence condition is met isdetermined. MCCFR typically includes multiple iterations. Theconvergence condition can be used for determining whether to continue orterminate the iteration. In some embodiments, the convergence conditioncan be based on exploitability of a strategy σ. According to thedefinition of exploitability, exploitability should be larger than orequal to 0. The smaller exploitability indicates a better strategy. Thatis, the exploitability of converged strategy should approach 0 afterenough iterations. For example, in poker, when the exploitability isless than 1, the time-average strategy is regarded as a good strategyand it is determined that the convergence condition is met. In someembodiments, the convergence condition can be based on a predeterminednumber of iterations. For example, in a small game, the iterations canbe easily determined by the exploitability. That is, if exploitabilityis small enough, the process 900 can terminate. In a large game, theexploitability is intractable and typically a large parameter foriteration can be specified. After each iteration, a new strategy profilecan be obtained, which is better than the old one. For example, in alarge game, the process 900 can terminate after a sufficient number ofiterations.

If the convergence condition is met, no further iteration is needed. Theprocess 900 proceeds to 922, and operations of the execution device arecontrolled according to the action selection policy. In someembodiments, the action selection policy comprises an action selectionpolicy of the execution device in the non-terminal state. In someembodiments, operations of the execution device are controlled accordingto the action selection policy comprises controlling operations of theexecution device in the non-terminal state based on the action selectionpolicy in the non-terminal state for the next iteration. In someembodiments, the action selection policy in the current iteration, or anaverage action selection policy across the t iterations can be output ascontrol commands to control one or more of a direction, speed, distance,or other operation of an engine, motor, valve, actuator, accelerator,brake, or other device in an autonomous vehicle or other applications.If the convergence condition is not met, t is increased by 1, and theprocess 900 proceeds to a next iteration, wherein t>1.

In some embodiments, each iteration of the process 900 can include abottom-up process for computing CFVs and updating action selectionpolicies of different states. For example, the process 900 can startfrom terminal states (e.g., the leaf node or terminal node 143 a, 153 a,153 b, 143 c, 143 d, 147 a, 157 a, 157 b, 147 c, or 147 d of the gametree 100 in FIG. 1A, or the terminal node 145 a, 145 c, 145 d, 155 a, or155 b of the public tree 150 in FIG. 1B) and move up to the initialstate (e.g., the root node 110 of the game tree 100 in FIG. 1A or theroot node 125 of the public tree 150 in FIG. 1B).

In a current iteration (e.g., t-th iteration), at 905, a counterfactualvalue (CFV) (e.g., {tilde over (v)}_(i) ^(σ) ^(t) (I_(i), a|Q_(j))) ofthe execution device in a terminal state of completing a task iscomputed based on a payoff of the execution device at the terminal stateand a reach probability of the one or more other devices reaching theterminal state, for example, according to the upper (or first) equationof Eq. (18).

The terminal state (e.g., terminal node 155 b in FIG. 1B) results from asequence of actions (e.g., a sequence of actions [A_(1a), A_(1b),A_(3b)]) that includes actions taken at a plurality of non-terminalstates (e.g., the non-terminal nodes 125, 135 a, and 145 b) by theexecution device (e.g., A_(1a) and A_(3b)) and by the one or more otherdevices (e.g., A_(2b)). In some embodiments, each of the plurality ofnon-terminal states has one or more child states. For example, thenon-terminal nodes 125 has two child states, nodes 135 a and 135 b; thenon-terminal nodes 135 a has two child states, nodes 145 a and 145 b;and the non-terminal nodes 145 b has two child states, nodes 155 a and155 b.

In some embodiments, the reach probability of the one or more otherdevices reach the terminal state comprises a product of probabilities ofactions taken by the one or more other devices reach the terminal state.For example, if the execution device corresponding to player 1, thereach probability of the one or more other devices (e.g., correspondingto player 2) reaching the terminal state (e.g., terminal node 155 b)comprises a product of probabilities of actions (e.g., Am) taken by theone or more other devices reach the terminal state. If the executiondevice corresponding to player 2, the reach probability of the one ormore other devices (e.g., corresponding to player 1) reaching theterminal state (e.g., terminal node 155 b) comprises a product ofprobabilities of actions (e.g., A_(1a) and A_(3b)) taken by the one ormore other devices that reach the terminal state.

At 906, a baseline-corrected CFV (e.g., {circumflex over (v)}_(i) ^(σ)^(t) (I_(i), a|Q_(j))) of the execution device in the terminal state iscomputed based on the CFV of the execution device in the terminal state,a CFV baseline of the execution device in the terminal state of aprevious iteration, or both, for example, according to Eq. (17). Forexample, a sampled CFV baseline of the execution device (e.g., {tildeover (b)}_(i) ^(t−1)(a|I_(i))) that takes the action in the terminalstate of the previous iteration is computed based on the CFV baseline ofthe execution device in the terminal state of the previous iteration, asampling policy of the execution device that takes the action in theterminal state of the previous iteration, and a probability of reachingthe terminal state results from a sequence of actions taken by theexecution device, for example, according to Eq. (16). In response todetermining that the action is sampled, a baseline-corrected CFV of theexecution device (e.g., {tilde over (v)}_(i) ^(σ) ^(t) (I_(i), a|Q_(j)))that takes the action in the non-terminal state is computed based on theCFV of the execution device in the non-terminal state and the sampledCFV baseline of the execution device that takes the action in theterminal state of the previous iteration, for example, according to thelower (or second) equation of Eq. (17). In response to determining thatthe action is not sampled, the sampled CFV baseline of the executiondevice that takes the action in the terminal state of the previousiteration is used as the baseline-corrected CFV of the execution devicein the non-terminal state, for example, according to the top (or first)equation of Eq. (17).

In some embodiments, for each of the non-terminal states and startingfrom a non-terminal state that has the terminal state and one or moreother terminal states as child states, at 908, a CFV of the executiondevice in the non-terminal state (e.g., estimate counterfactual value{tilde over (v)}_(i) ^(σ) ^(t) (I_(i), a|_(j))) is computed based on aweighted sum of the baseline-corrected CFVs of the execution device inthe child states of the non-terminal state. In some embodiments, theweighted sum of the baseline-corrected CFV of the execution device inthe terminal state and corresponding baseline-corrected CFVs of theexecution device in the one or more other terminal states is computedbased on the baseline-corrected CFV of the execution device in theterminal state and corresponding baseline-corrected CFVs of theexecution device in the one or more other terminal states weighted by anaction selection policy in the non-terminal state in the currentiteration, for example, according to the lower (or second) equation ofEq. (18).

At 910, a baseline-corrected CFV (e.g., {circumflex over (v)}_(i) ^(σ)^(t) (I_(i), a|Q_(j))) of the execution device in the non-terminal stateis computed based on the CFV of the execution device in the non-terminalstate, a CFV baseline of the execution device in the non-terminal stateof a previous iteration, or both, for example, according to Eq. (17)similar to the techniques described w.r.t. 906.

At 912, a CFV baseline (e.g., b_(i) ^(t)(a|I_(i))) of the executiondevice in the non-terminal state of the current iteration is computedbased on a weighted sum of the CFV baseline of the execution device inthe non-terminal state of the previous iteration and the CFV (e.g.,{tilde over (v)}_(i) ^(σ) ^(t) (I_(i), a|Q_(j))) or thebaseline-corrected CFV (e.g., {circumflex over (v)}_(i) ^(σ) ^(t)(I_(i), a|Q_(j))) of the execution device in the non-terminal state, forexample, according to Eq. (19). In some embodiments, the weighted sum ofthe CFV baseline of the execution device in the non-terminal state ofthe previous iteration and the CFV or the baseline-corrected CFV of theexecution device in the non-terminal state comprises a sum of the CFVbaseline of the execution device in the non-terminal state of theprevious iteration weighted by a scalar (e.g., (1−γ)); and the CFV orthe baseline-corrected CFV of the execution device in the non-terminalstate weighted by a second scalar (e.g., (γ)) and a probability ofconsidering the non-terminal state (e.g., q(I_(i)), for example,according to the lower (or second) equation of Eq. (19).

At 916, an action selection policy in the non-terminal state for thenext iteration is determined based on the baseline-corrected CFV of theexecution device in the non-terminal state of the current iteration. Insome embodiments, the baseline-corrected CFV of each node can be used tocompute the regret, cumulative regret, current strategy, and averagestrategy, for example, according to Eqs. (10), (12), (7) and (8),respectively. In some embodiments, determining an action selectionpolicy in the non-terminal state for the next iteration based on thebaseline-corrected CFV of the execution device in the non-terminal stateof the current iteration comprises: calculating a regret value based onthe baseline-corrected CFV of the execution device in the non-terminalstate of the current iteration (e.g., according to some or all of Eqs.(10)-(12)); and determining an action selection policy in thenon-terminal state for the next iteration based on the regret valueaccording to regret matching (e.g., according to regret matching basedon Eq. (8) or regret matching+). In some embodiments, an averagestrategy σ _(i) ^(t) after the current iteration can be computed, forexample, according to Eq. (8).

At 918, whether the current state is the initial state is determined. Insome embodiments, such a determination can be used for determiningwhether to continue or terminate updating the baseline-corrected CFV ofthe states in the current iteration. If the current state is the initialstate, no further updating of the baseline-corrected CFV is needed. Theprocess 900 goes to a next iteration to 904. If the current state is notthe initial state, a previous or parent state of the state (e.g., aparent node of the current node in a game tree or public tree) is usedto replace the current state, and the process 900 goes back to 908 toobtain a CFV for each action of the previous state. The process 900 cancontinue as shown in FIG. 9.

In some embodiments, as described above, for each iteration of theprocess 900, only the terminal states would require computing thecounterfactual value (e.g., {tilde over (v)}_(i) ^(σ) ^(t)(I_(i)a|Q_(j))) based on a multiplication of a payoff of the executiondevice at the terminal state and a reach probability of the one or moreother devices reaching the terminal state (e.g., for example, accordingto the upper (or first) equation of Eq. (18). For non-terminal states,the counterfactual values and/or baseline-enhanced counterfactual valuescan be computed based on weighted sums of the counterfactual valuesand/or baseline-enhanced counterfactual values of the terminal statesbecause the baseline are based on the counterfactual values, rather thanthe expected utility values. As such, compared to variance-reductiontechniques using expected utility value baselines that compute thecounterfactual value based on an utility value matrix of player i andthe opponent's range matrix (i.e., the reach probability of theopponent), for example, according to Eq. (4), the process 900 can reducethe computational load and improve the computational efficiency. In someembodiments, the computational load saved by the counterfactual valuebaseline relative to expected utility value baseline can depend on adepth of and/or a number of non-terminal states in the game tree orpublic tree that represents the environment or the IIG.

FIG. 10 depicts a block diagram illustrating an example of acomputer-implemented system used to provide computationalfunctionalities associated with described algorithms, methods,functions, processes, flows, and procedures in accordance withembodiments of this specification. FIG. 10 is a block diagramillustrating an example of a computer-implemented System 1000 used toprovide computational functionalities associated with describedalgorithms, methods, functions, processes, flows, and procedures,according to an embodiment of the present disclosure. In the illustratedembodiment, System 1000 includes a Computer 1002 and a Network 1030.

The illustrated Computer 1002 is intended to encompass any computingdevice such as a server, desktop computer, laptop/notebook computer,wireless data port, smart phone, personal data assistant (PDA), tabletcomputer, one or more processors within these devices, another computingdevice, or a combination of computing devices, including physical orvirtual instances of the computing device, or a combination of physicalor virtual instances of the computing device. Additionally, the Computer1002 can include an input device, such as a keypad, keyboard, touchscreen, another input device, or a combination of input devices that canaccept user information, and an output device that conveys informationassociated with the operation of the Computer 1002, including digitaldata, visual, audio, another type of information, or a combination oftypes of information, on a graphical-type user interface (UI) (or GUI)or other UI.

The Computer 1002 can serve in a role in a distributed computing systemas a client, network component, a server, a database or anotherpersistency, another role, or a combination of roles for performing thesubject matter described in the present disclosure. The illustratedComputer 1002 is communicably coupled with a Network 1030. In someembodiments, one or more components of the Computer 1002 can beconfigured to operate within an environment, includingcloud-computing-based, local, global, another environment, or acombination of environments.

At a high level, the Computer 1002 is an electronic computing deviceoperable to receive, transmit, process, store, or manage data andinformation associated with the described subject matter. According tosome embodiments, the Computer 1002 can also include or be communicablycoupled with a server, including an application server, e-mail server,web server, caching server, streaming data server, another server, or acombination of servers.

The Computer 1002 can receive requests over Network 1030 (for example,from a client software application executing on another Computer 1002)and respond to the received requests by processing the received requestsusing a software application or a combination of software applications.In addition, requests can also be sent to the Computer 1002 frominternal users (for example, from a command console or by anotherinternal access method), external or third-parties, or other entities,individuals, systems, or computers.

Each of the components of the Computer 1002 can communicate using aSystem Bus 1003. In some embodiments, any or all of the components ofthe Computer 1002, including hardware, software, or a combination ofhardware and software, can interface over the System Bus 1003 using anapplication programming interface (API) 1012, a Service Layer 1013, or acombination of the API 1012 and Service Layer 1013. The API 1012 caninclude specifications for routines, data structures, and objectclasses. The API 1012 can be either computer-language independent ordependent and refer to a complete interface, a single function, or evena set of APIs. The Service Layer 1013 provides software services to theComputer 1002 or other components (whether illustrated or not) that arecommunicably coupled to the Computer 1002. The functionality of theComputer 1002 can be accessible for all service consumers using theService Layer 1013. Software services, such as those provided by theService Layer 1013, provide reusable, defined functionalities through adefined interface. For example, the interface can be software written inJAVA, C++, another computing language, or a combination of computinglanguages providing data in extensible markup language (XML) format,another format, or a combination of formats. While illustrated as anintegrated component of the Computer 1002, alternative embodiments canillustrate the API 1012 or the Service Layer 1013 as stand-alonecomponents in relation to other components of the Computer 1002 or othercomponents (whether illustrated or not) that are communicably coupled tothe Computer 1002. Moreover, any or all parts of the API 1012 or theService Layer 1013 can be implemented as a child or a sub-module ofanother software module, enterprise application, or hardware modulewithout departing from the scope of the present disclosure.

The Computer 1002 includes an Interface 1004. Although illustrated as asingle Interface 1004, two or more Interfaces 1004 can be used accordingto particular needs, desires, or particular embodiments of the Computer1002. The Interface 1004 is used by the Computer 1002 for communicatingwith another computing system (whether illustrated or not) that iscommunicatively linked to the Network 1030 in a distributed environment.Generally, the Interface 1004 is operable to communicate with theNetwork 1030 and includes logic encoded in software, hardware, or acombination of software and hardware. More specifically, the Interface1004 can include software supporting one or more communication protocolsassociated with communications such that the Network 1030 or hardware ofInterface 1004 is operable to communicate physical signals within andoutside of the illustrated Computer 1002.

The Computer 1002 includes a Processor 1005. Although illustrated as asingle Processor 1005, two or more Processors 1005 can be used accordingto particular needs, desires, or particular embodiments of the Computer1002. Generally, the Processor 1005 executes instructions andmanipulates data to perform the operations of the Computer 1002 and anyalgorithms, methods, functions, processes, flows, and procedures asdescribed in the present disclosure.

The Computer 1002 also includes a Database 1006 that can hold data forthe Computer 1002, another component communicatively linked to theNetwork 1030 (whether illustrated or not), or a combination of theComputer 1002 and another component. For example, Database 1006 can bean in-memory, conventional, or another type of database storing dataconsistent with the present disclosure. In some embodiments, Database1006 can be a combination of two or more different database types (forexample, a hybrid in-memory and conventional database) according toparticular needs, desires, or particular embodiments of the Computer1002 and the described functionality. Although illustrated as a singleDatabase 1006, two or more databases of similar or differing types canbe used according to particular needs, desires, or particularembodiments of the Computer 1002 and the described functionality. WhileDatabase 1006 is illustrated as an integral component of the Computer1002, in alternative embodiments, Database 1006 can be external to theComputer 1002. As an example, Database 1006 can include theabove-described action selection policies (strategies) 1026, forexample, for computing an accumulative and/or average action selection(strategy).

The Computer 1002 also includes a Memory 1007 that can hold data for theComputer 1002, another component or components communicatively linked tothe Network 1030 (whether illustrated or not), or a combination of theComputer 1002 and another component. Memory 1007 can store any dataconsistent with the present disclosure. In some embodiments, Memory 1007can be a combination of two or more different types of memory (forexample, a combination of semiconductor and magnetic storage) accordingto particular needs, desires, or particular embodiments of the Computer1002 and the described functionality. Although illustrated as a singleMemory 1007, two or more Memories 1007 or similar or differing types canbe used according to particular needs, desires, or particularembodiments of the Computer 1002 and the described functionality. WhileMemory 1007 is illustrated as an integral component of the Computer1002, in alternative embodiments, Memory 1007 can be external to theComputer 1002.

The Application 1008 is an algorithmic software engine providingfunctionality according to particular needs, desires, or particularembodiments of the Computer 1002, particularly with respect tofunctionality described in the present disclosure. For example,Application 1008 can serve as one or more components, modules, orapplications. Further, although illustrated as a single Application1008, the Application 1008 can be implemented as multiple Applications1008 on the Computer 1002. In addition, although illustrated as integralto the Computer 1002, in alternative embodiments, the Application 1008can be external to the Computer 1002.

The Computer 1002 can also include a Power Supply 1014. The Power Supply1014 can include a rechargeable or non-rechargeable battery that can beconfigured to be either user- or non-user-replaceable. In someembodiments, the Power Supply 1014 can include power-conversion ormanagement circuits (including recharging, standby, or another powermanagement functionality). In some embodiments, the Power Supply 1014can include a power plug to allow the Computer 1002 to be plugged into awall socket or another power source to, for example, power the Computer1002 or recharge a rechargeable battery.

There can be any number of Computers 1002 associated with, or externalto, a computer system containing Computer 1002, each Computer 1002communicating over Network 1030. Further, the term “client,” “user,” orother appropriate terminology can be used interchangeably, asappropriate, without departing from the scope of the present disclosure.Moreover, the present disclosure contemplates that many users can useone Computer 1002, or that one user can use multiple computers 1002.

FIG. 11 is a diagram of an example of modules of an apparatus 1100 inaccordance with embodiments of this specification. The apparatus 1100can be an example embodiment of a data processing apparatus or anexecution device for generating an action selection policy forcompleting a task in an environment that includes the execution deviceand one or more other devices. The apparatus 1100 can correspond to theembodiments described above, and the apparatus 1100 includes thefollowing: a first identifying module 1101 for identifying a pluralityof possible actions in a state, wherein the state corresponds to avector of information sets, and each information set in the vector ofinformation sets comprises a sequence of actions taken by the executiondevice that leads to the state; a second identifying module 1102 foridentifying a vector of current action selection policies in the state,wherein each current action selection policy in the vector of currentaction selection policies corresponds to an information set in thevector of information sets, and the action selection policy specifies arespective probability of selecting an action among the plurality ofpossible actions in the state; a computing module 1103 for computing asampling policy based on the vector of current action selection policiesin the state, wherein the sampling policy specifies a respectivesampling probability corresponding to each of the plurality of possibleactions in the state; a sampling module 1104 for sampling an actionamong the plurality of possible actions in the state according to asampling probability of the action specified in the sampling policy; andan updating module 1105 for updating the each current action selectionpolicy in the vector of current action selection policies of theexecution device in the state based on the action.

In some embodiments, the apparatus 1100 further includes the following:a controlling module 1106 for controlling operations of the executiondevice based on the action selection policy in response to determiningthat a convergence condition is met.

In some embodiments, wherein updating the each current action selectionpolicy in the vector of current action selection policies of theexecution device in the state based on the action comprises performingMonte Carlo counterfactual regret minimization (MCCFR) based on theaction.

In some embodiments, wherein updating the each current action selectionpolicy in the vector of current action selection policies of theexecution device in the state based on the action comprises: calculatinga probability of a sampled terminal sequence of actions based on thesampling probability of the action, the sampled terminal sequence ofactions including the action and a terminal state for completing a task;calculating a sampled counterfactual value of the action based on theprobability of the sampled terminal sequence of actions; calculating aregret value of the action based on the sampled counterfactual value ofthe action; and updating the each of the vector of current actionselection policies of the execution device in the state based on theregret value of the action.

In some embodiments, wherein the state corresponds to a public sequencethat comprises one or more actions publically known by the executiondevice and the one or more other devices; and the each information setin the vector of information sets comprises the public sequence.

In some embodiments, wherein computing a sampling policy based on thevector of current action selection policies of the execution device inthe state comprises: computing the sampling probability corresponding toeach of the plurality of possible actions in the state as a mean valueof current action selection policies of each of the plurality ofpossible actions in the state over the vector of information sets.

In some embodiments, wherein computing a sampling policy based on thevector of current action selection policies of the execution device inthe state comprises: computing the sampling probability corresponding toeach of the plurality of possible actions in the state based on currentaction selection policies of each of the plurality of possible actionsin the state and respective reach probabilities of the vector ofinformation sets.

In some embodiments, wherein computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on current action selection policies of each of the plurality ofpossible actions in the state and respective reach probabilities of thevector of information sets comprises: computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on a sum of the current action selection policies of each of theplurality of possible actions in the state weighted by the respectivereach probabilities of the vector of information sets.

In some embodiments, wherein computing a sampling policy based on thevector of current action selection policies of the execution device inthe state comprises: computing the sampling probability corresponding toeach of the plurality of possible actions in the state based on averageaction selection policies of each of the plurality of possible actionsin the state and respective reach probabilities of the vector ofinformation sets.

In some embodiments, wherein computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on average action selection policies of each of the plurality ofpossible actions in the state and respective reach probabilities of thevector of information sets comprises: computing the sampling probabilitycorresponding to each of the plurality of possible actions in the statebased on a sum of the average action selection policies of each of theplurality of possible actions in the state weighted by the respectivereach probabilities of the vector of information sets.

FIG. 12 is a diagram of an example of modules of an apparatus 1200 inaccordance with embodiments of this specification. The apparatus 1200can be an example embodiment of a data processing apparatus or anexecution device for generating an action selection policy forcompleting a task in an environment that includes the execution deviceand one or more other devices. The apparatus 1200 can correspond to theembodiments described above, and the apparatus 1200 includes thefollowing: a computing module 1201 for computing a hybrid samplingpolicy at a state of the execution device based on a sampling policy andan exploration policy, wherein the state corresponds to a plurality ofpossible actions that lead to respective next states if performed by theexecution device when the execution device is in the state, wherein thesampling policy specifies a respective sampling probability of samplingeach of the plurality of possible actions in the state, wherein theexploration policy specifies a respective exploration probabilitycorresponding to each of the plurality of possible actions in the state,wherein the exploration probability is negatively correlated with anumber of times that the each of the plurality of possible actions inthe state has been sampled; a sampling module 1202 for sampling anaction among the plurality of possible actions in the state according toa sampling probability of the action specified in the hybrid samplingpolicy; and an updating module 1203 for updating an action selectionpolicy of the execution device in the state by performing Monte Carlocounterfactual regret minimization (MCCFR) based on the action, whereinthe action selection policy specifies a respective probability ofselecting an action among the plurality of possible actions in the statefor completing the task in the environment.

In some embodiments, the apparatus 1200 further includes the following:a controlling module 1204 for controlling operations of the executiondevice based on the action selection policy in response to determiningthat a convergence condition is met.

In some embodiments, the apparatus 1200 further includes the following:an increasing module for increasing a number of times that the actionhas been sampled in the state; and a decreasing module for decreasingthe exploration probability corresponding to the action out of theplurality of possible actions in the state for computing the hybridsampling policy in a next iteration, in response to sampling the actionout of the plurality of possible actions in the state according to thehybrid sampling policy.

In some embodiments, wherein performing Monte Carlo counterfactualregret minimization (MCCFR) based on the action comprises: calculating aprobability of a sampled terminal sequence of actions based on a hybridsampling probability of the action, the sampled terminal sequence ofactions including the action and a terminal state for completing a task;calculating a sampled counterfactual value of the action based on theprobability of the sampled terminal sequence of actions; calculating aregret value of the action based on the sampled counterfactual value ofthe action; and updating the action selection policy of the executiondevice in the state based on the regret value of the action.

In some embodiments, wherein: the sampling policy comprises a firstprobability distribution over the plurality of possible actions in thestate, and the exploration policy comprises a second probabilitydistribution over the plurality of possible actions in the state.

In some embodiments, wherein computing a hybrid sampling policy based ona sampling policy and an exploration policy comprises: computing aprobability of each of the plurality of possible actions in the statebased on a weight sum of the sampling probability of each of theplurality of possible actions in the state and the explorationprobability of each of the plurality of possible actions in the state.

In some embodiments, wherein computing a hybrid sampling policy based ona sampling policy and an exploration policy comprises: computing aprobability of each of the plurality of possible actions in the stateaccording to:

σ_(i) ^(se)(a|I _(i))=(1−α)*σ_(i) ^(s)(a|I _(i))+α*σ_(i) ^(e)(a|I _(i)).

wherein: I_(i) represents an information set of the state; a representsone of the plurality of possible actions; σ_(i) ^(se)(a|I_(i))represents a hybrid sampling policy of sampling the action a in thestate; σ_(i) ^(s)(a|I_(i)) represents a sampling policy of sampling theaction a in the state; σ_(i) ^(e)(a|I_(i)) represents an explorationpolicy of exploring the action a in the state; and α ∈ [0,1] representsa factor that controls a weight of exploration.

In some embodiments, wherein the exploration probability of each of theplurality of possible actions in the state is computed according to:

${{\sigma_{i}^{e,t}\left( a \middle| I_{i} \right)} = \frac{\left( {1 + \frac{\beta}{\sqrt{c^{t}\left( a \middle| I_{i} \right)}}} \right)}{\sum_{a \in {A{(I_{i})}}}\left( {1 + \frac{\beta}{\sqrt{c^{t}\left( a \middle| I_{i} \right)}}} \right)}},$

wherein: i represents an identifier of the execution device; Lrepresents an information set of the state; A(I_(i)) represents theplurality of possible actions in the state; a represents one of theplurality of possible actions in the state; t represents a currentiteration; C^(t)(a|I_(i)) represents a number of times that the action ahas been sampled in the state up to the current iteration t; σ_(i)^(e,t)(a|I_(i)) represents an exploration policy of exploring the actiona at the state in the current iteration t; and β is a nonnegative realnumber.

FIG. 13 is a diagram of an example of modules of an apparatus 1300 inaccordance with embodiments of this specification. The apparatus 1300can be an example embodiment of a data processing apparatus or anexecution device for generating an action selection policy forcompleting a task in an environment that includes the execution deviceand one or more other devices. The apparatus 1300 can correspond to theembodiments described above, and the apparatus 1300 includes thefollowing: for in a current iteration of a plurality of iterations, afirst computing module 1301 for computing a counterfactual value (CFV)of the execution device in a terminal state of completing a task basedon a payoff of the execution device at the terminal state and a reachprobability of the one or more other devices reaching the terminalstate, wherein the terminal state results from a sequence of actionstaken at a plurality of non-terminal states by the execution device andby the one or more other devices, wherein each of the plurality ofnon-terminal states has one or more child states; a second computingmodule 1302 for computing a baseline-corrected CFV of the executiondevice in the terminal state based on the CFV of the execution device inthe terminal state, a CFV baseline of the execution device in theterminal state of a previous iteration, or both; for each of thenon-terminal states and starting from a non-terminal state that has theterminal state and one or more other terminal states as child states: athird computing module 1303 for computing a CFV of the execution devicein the non-terminal state based on a weighted sum of thebaseline-corrected CFVs of the execution device in the child states ofthe non-terminal state; a fourth computing module 1304 for computing abaseline-corrected CFV of the execution device in the non-terminal statebased on the CFV of the execution device in the non-terminal state, aCFV baseline of the execution device in the non-terminal state of aprevious iteration, or both; a fifth computing module 1305 for computinga CFV baseline of the execution device in the non-terminal state of thecurrent iteration based on a weighted sum of the CFV baseline of theexecution device in the non-terminal state of the previous iteration andthe CFV or the baseline-corrected CFV of the execution device in thenon-terminal state; and a determining module 1306 for determining anaction selection policy in the non-terminal state for the next iterationbased on the baseline-corrected CFV of the execution device in thenon-terminal state of the current iteration.

In some embodiments, the apparatus 1300 further includes the following:a controlling identifying module 1307 for controlling operations of theexecution device in the non-terminal state based on the action selectionpolicy in the non-terminal state for the next iteration in response todetermining that a convergence condition is met.

In some embodiments, wherein determining an action selection policy inthe non-terminal state for the next iteration based on thebaseline-corrected CFV of the execution device in the non-terminal stateof the current iteration comprises: calculating a regret value based onthe baseline-corrected CFV of the execution device in the non-terminalstate of the current iteration; and determining an action selectionpolicy in the non-terminal state for the next iteration based on theregret value according to regret matching.

In some embodiments, wherein the reach probability of the one or moreother devices reaching the terminal state comprises a product ofprobabilities of actions taken by the one or more other devices reachthe terminal state.

In some embodiments, wherein computing a baseline-corrected CFV of theexecution device in the non-terminal state based on the CFV of theexecution device in the non-terminal state, a CFV baseline of theexecution device in the non-terminal state of a previous iteration, orboth comprises: computing a sampled CFV baseline of the execution devicethat takes the action in the terminal state of the previous iterationbased on the CFV baseline of the execution device in the terminal stateof the previous iteration, a sampling policy of the execution devicethat takes the action in the terminal state of the previous iteration,and a probability of reaching the terminal state results from a sequenceof actions taken by the execution device; in response to determiningthat the action is sampled, computing a baseline-corrected CFV of theexecution device that takes the action in the non-terminal state basedon the CFV of the execution device in the non-terminal state and thesampled CFV baseline of the execution device that takes the action inthe terminal state of the previous iteration; and in response todetermining that the action is not sampled, using the sampled CFVbaseline of the execution device that takes the action in the terminalstate of the previous iteration as the baseline-corrected CFV of theexecution device in the non-terminal state.

In some embodiments, wherein the weighted sum of the baseline-correctedCFV of the execution device in the terminal state and correspondingbaseline-corrected CFVs of the execution device in the one or more otherterminal states is computed based on the baseline-corrected CFV of theexecution device in the terminal state and correspondingbaseline-corrected CFVs of the execution device in the one or more otherterminal states weighted by an action selection policy in thenon-terminal state in the current iteration.

In some embodiments, wherein the weighted sum of the CFV baseline of theexecution device in the non-terminal state of the previous iteration andthe CFV or the baseline-corrected CFV of the execution device in thenon-terminal state comprises a sum of: the CFV baseline of the executiondevice in the non-terminal state of the previous iteration weighted by ascalar; and the CFV or the baseline-corrected CFV of the executiondevice in the non-terminal state weighted by a second scalar and aprobability of considering the non-terminal state.

The system, apparatus, module, or unit illustrated in the previousembodiments can be implemented by using a computer chip or an entity, orcan be implemented by using a product having a certain function. Atypical embodiment device is a computer, and the computer can be apersonal computer, a laptop computer, a cellular phone, a camera phone,a smartphone, a personal digital assistant, a media player, a navigationdevice, an email receiving and sending device, a game console, a tabletcomputer, a wearable device, or any combination of these devices.

For an embodiment process of functions and roles of each module in theapparatus, references can be made to an embodiment process ofcorresponding steps in the previous method. Details are omitted here forsimplicity.

Because an apparatus embodiment basically corresponds to a methodembodiment, for related parts, references can be made to relateddescriptions in the method embodiment. The previously describedapparatus embodiment is merely an example. The modules described asseparate parts may or may not be physically separate, and partsdisplayed as modules may or may not be physical modules, may be locatedin one position, or may be distributed on a number of network modules.Some or all of the modules can be selected based on actual demands toachieve the objectives of the solutions of the specification. A personof ordinary skill in the art can understand and implement theembodiments of the present application without creative efforts.

Referring again to FIGS. 11-13, each of the figures can be interpretedas illustrating an internal functional module and a structure of a dataprocessing apparatus or an execution device for generating an actionselection policy for completing a task in an environment that includesthe execution device and one or more other devices. An execution body inessence can be an electronic device, and the electronic device includesthe following: one or more processors; and one or more computer-readablememories configured to store an executable instruction of the one ormore processors. In some embodiments, the one or more computer-readablememories are coupled to the one or more processors and have programminginstructions stored thereon that are executable by the one or moreprocessors to perform algorithms, methods, functions, processes, flows,and procedures, as described in this specification. This specificationalso provides one or more non-transitory computer-readable storage mediacoupled to one or more processors and having instructions stored thereonwhich, when executed by the one or more processors, cause the one ormore processors to perform operations in accordance with embodiments ofthe methods provided herein.

This specification further provides a system for implementing themethods provided herein. The system includes one or more processors, anda computer-readable storage medium coupled to the one or more processorshaving instructions stored thereon which, when executed by the one ormore processors, cause the one or more processors to perform operationsin accordance with embodiments of the methods provided herein.

Embodiments of the subject matter and the actions and operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, e.g.,one or more modules of computer program instructions, encoded on acomputer program carrier, for execution by, or to control the operationof, data processing apparatus. For example, a computer program carriercan include one or more computer-readable storage media that haveinstructions encoded or stored thereon. The carrier may be a tangiblenon-transitory computer-readable medium, such as a magnetic, magnetooptical, or optical disk, a solid state drive, a random access memory(RAM), a read-only memory (ROM), or other types of media. Alternatively,or in addition, the carrier may be an artificially generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus. The computer storage medium can be or be part of amachine-readable storage device, a machine-readable storage substrate, arandom or serial access memory device, or a combination of one or moreof them. A computer storage medium is not a propagated signal.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, an engine, a script, or code, can be written in any form ofprogramming language, including compiled or interpreted languages, ordeclarative or procedural languages; and it can be deployed in any form,including as a stand-alone program or as a module, component, engine,subroutine, or other unit suitable for executing in a computingenvironment, which environment may include one or more computersinterconnected by a data communication network in one or more locations.

A computer program may, but need not, correspond to a file in a filesystem. A computer program can be stored in a portion of a file thatholds other programs or data, e.g., one or more scripts stored in amarkup language document, in a single file dedicated to the program inquestion, or in multiple coordinated files, e.g., files that store oneor more modules, sub programs, or portions of code.

Processors for execution of a computer program include, by way ofexample, both general- and special-purpose microprocessors, and any oneor more processors of any kind of digital computer. Generally, aprocessor will receive the instructions of the computer program forexecution as well as data from a non-transitory computer-readable mediumcoupled to the processor.

The term “data processing apparatus” encompasses all kinds ofapparatuses, devices, and machines for processing data, including by wayof example a programmable processor, a computer, or multiple processorsor computers. Data processing apparatus can include special-purposelogic circuitry, e.g., an FPGA (field programmable gate array), an ASIC(application specific integrated circuit), or a GPU (graphics processingunit). The apparatus can also include, in addition to hardware, codethat creates an execution environment for computer programs, e.g., codethat constitutes processor firmware, a protocol stack, a databasemanagement system, an operating system, or a combination of one or moreof them.

The processes and logic flows described in this specification can beperformed by one or more computers or processors executing one or morecomputer programs to perform operations by operating on input data andgenerating output. The processes and logic flows can also be performedby special-purpose logic circuitry, e.g., an FPGA, an ASIC, or a GPU, orby a combination of special-purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special-purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read only memory or a random accessmemory or both. Elements of a computer can include a central processingunit for executing instructions and one or more memory devices forstoring instructions and data. The central processing unit and thememory can be supplemented by, or incorporated in, special-purpose logiccircuitry.

Generally, a computer will also include, or be operatively coupled toreceive data from or transfer data to one or more storage devices. Thestorage devices can be, for example, magnetic, magneto optical, oroptical disks, solid state drives, or any other type of non-transitory,computer-readable media. However, a computer need not have such devices.Thus, a computer may be coupled to one or more storage devices, such as,one or more memories, that are local and/or remote. For example, acomputer can include one or more local memories that are integralcomponents of the computer, or the computer can be coupled to one ormore remote memories that are in a cloud network. Moreover, a computercan be embedded in another device, e.g., a mobile telephone, a personaldigital assistant (PDA), a mobile audio or video player, a game console,a Global Positioning System (GPS) receiver, or a portable storagedevice, e.g., a universal serial bus (USB) flash drive, to name just afew.

Components can be “coupled to” each other by being commutatively such aselectrically or optically connected to one another, either directly orvia one or more intermediate components. Components can also be “coupledto” each other if one of the components is integrated into the other.For example, a storage component that is integrated into a processor(e.g., an L2 cache component) is “coupled to” the processor.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on, orconfigured to communicate with, a computer having a display device,e.g., an LCD (liquid crystal display) monitor, for displayinginformation to the user, and an input device by which the user canprovide input to the computer, e.g., a keyboard and a pointing device,e.g., a mouse, a trackball or touchpad. Other kinds of devices can beused to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's device in response to requests received from the web browser, orby interacting with an app running on a user device, e.g., a smartphoneor electronic tablet. Also, a computer can interact with a user bysending text messages or other forms of messages to a personal device,e.g., a smartphone that is running a messaging application, andreceiving responsive messages from the user in return.

This specification uses the term “configured to” in connection withsystems, apparatus, and computer program components. For a system of oneor more computers to be configured to perform particular operations oractions means that the system has installed on it software, firmware,hardware, or a combination of them that in operation cause the system toperform the operations or actions. For one or more computer programs tobe configured to perform particular operations or actions means that theone or more programs include instructions that, when executed by dataprocessing apparatus, cause the apparatus to perform the operations oractions. For special-purpose logic circuitry to be configured to performparticular operations or actions means that the circuitry has electroniclogic that performs the operations or actions.

While this specification contains many specific embodiment details,these should not be construed as limitations on the scope of what isbeing claimed, which is defined by the claims themselves, but rather asdescriptions of features that may be specific to particular embodiments.Certain features that are described in this specification in the contextof separate embodiments can also be realized in combination in a singleembodiment. Conversely, various features that are described in thecontext of a single embodiment can also be realized in multipleembodiments separately or in any suitable subcombination. Moreover,although features may be described above as acting in certaincombinations and even initially be claimed as such, one or more featuresfrom a claimed combination can in some cases be excised from thecombination, and the claim may be directed to a subcombination orvariation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

1. A computer-implemented method of an execution device for generating an action selection policy for completing a task in an environment that includes the execution device and one or more other devices, the method comprising: in a current iteration of a plurality of iterations, computing, by the execution device, a counterfactual value (CFV) of the execution device in a terminal state of completing a task based on a payoff of the execution device at the terminal state and a reach probability of the one or more other devices reaching the terminal state, wherein the terminal state results from a sequence of actions taken at a plurality of non-terminal states by the execution device and by the one or more other devices, wherein each of the plurality of non-terminal states has one or more child states; computing, by the execution device, a baseline-corrected CFV of the execution device in the terminal state based on the CFV of the execution device in the terminal state, a CFV baseline of the execution device in the terminal state of a previous iteration, or [[both]] the CFV of the execution device in the terminal state and the CFV baseline of the execution device in the terminal state of the previous iteration; for each of the non-terminal states and starting from a non-terminal state that has the terminal state and one or more other terminal states as child states: computing, by the execution device, a CFV of the execution device in the non-terminal state based on a weighted sum of baseline-corrected CFVs of the execution device in the child states of the non-terminal state; computing, by the execution device, a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration; computing, by the execution device, a CFV baseline of the execution device in the non-terminal state of the current iteration based on a weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV of the execution device in the non-terminal state or the baseline-corrected CFV of the execution device in the non-terminal state; and determining, by the execution device, an action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration.
 2. The computer-implemented method of claim 1, further comprising, in response to determining that a convergence condition is met, controlling operations of the execution device in the non-terminal state based on the action selection policy in the non-terminal state for the next iteration.
 3. The computer-implemented method of claim 1, wherein determining the action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration comprises: calculating a regret value based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration; and determining the action selection policy in the non-terminal state for the next iteration based on the regret value according to regret matching.
 4. The computer-implemented method of claim 1, wherein the reach probability of the one or more other devices reaching the terminal state comprises a product of probabilities of actions taken by the one or more other devices reach the terminal state.
 5. The computer-implemented method of claim 1, wherein computing a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration comprises: computing a sampled CFV baseline of the execution device that takes an action in the terminal state of the previous iteration based on the CFV baseline of the execution device in the terminal state of the previous iteration, a sampling policy of the execution device that takes the action in the terminal state of the previous iteration, and a probability of reaching the terminal state that results from the sequence of actions taken by the execution device; in response to determining that the action is sampled, computing a baseline-corrected CFV of the execution device that takes the action in the non-terminal state based on the CFV of the execution device in the non-terminal state and the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration; and in response to determining that the action is not sampled, using the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration as the baseline-corrected CFV of the execution device in the non-terminal state.
 6. The computer-implemented method of claim 1, wherein the weighted sum of the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states is computed based on the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states weighted by an action selection policy in the non-terminal state in the current iteration.
 7. The computer-implemented method of claim 1, wherein the weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV or the baseline-corrected CFV of the execution device in the non-terminal state comprises a sum of: the CFV baseline of the execution device in the non-terminal state of the previous iteration weighted by a scalar; and the CFV or the baseline-corrected CFV of the execution device in the non-terminal state weighted by a second scalar and a probability of considering the non-terminal state.
 8. A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations comprising: in a current iteration of a plurality of iterations, computing a counterfactual value (CFV) of an execution device in a terminal state of completing a task based on a payoff of the execution device at the terminal state and a reach probability of one or more other devices reaching the terminal state, wherein the terminal state results from a sequence of actions taken at a plurality of non-terminal states by the execution device and by the one or more other devices, wherein each of the plurality of non-terminal states has one or more child states; computing a baseline-corrected CFV of the execution device in the terminal state based on the CFV of the execution device in the terminal state, a CFV baseline of the execution device in the terminal state of a previous iteration, or the CFV of the execution device in the terminal state and the CFV baseline of the execution device in the terminal state of the previous iteration; for each of the non-terminal states and starting from a non-terminal state that has the terminal state and one or more other terminal states as child states: computing a CFV of the execution device in the non-terminal state based on a weighted sum of baseline-corrected CFVs of the execution device in the child states of the non-terminal state; computing a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration; computing a CFV baseline of the execution device in the non-terminal state of the current iteration based on a weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV of the execution device in the non-terminal state or the baseline-corrected CFV of the execution device in the non-terminal state; and determining an action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration.
 9. The non-transitory, computer-readable medium of claim 8, further comprising, in response to determining that a convergence condition is met, controlling operations of the execution device in the non-terminal state based on the action selection policy in the non-terminal state for the next iteration.
 10. The non-transitory, computer-readable medium of claim 8, wherein determining the action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration comprises: calculating a regret value based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration; and determining the action selection policy in the non-terminal state for the next iteration based on the regret value according to regret matching.
 11. The non-transitory, computer-readable medium of claim 8, wherein the reach probability of the one or more other devices reaching the terminal state comprises a product of probabilities of actions taken by the one or more other devices reach the terminal state.
 12. The non-transitory, computer-readable medium of claim 8, wherein computing a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration comprises: computing a sampled CFV baseline of the execution device that takes an action in the terminal state of the previous iteration based on the CFV baseline of the execution device in the terminal state of the previous iteration, a sampling policy of the execution device that takes the action in the terminal state of the previous iteration, and a probability of reaching the terminal state that results from the sequence of actions taken by the execution device; in response to determining that the action is sampled, computing a baseline-corrected CFV of the execution device that takes the action in the non-terminal state based on the CFV of the execution device in the non-terminal state and the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration; and in response to determining that the action is not sampled, using the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration as the baseline-corrected CFV of the execution device in the non-terminal state.
 13. The non-transitory, computer-readable medium of claim 8, wherein the weighted sum of the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states is computed based on the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states weighted by an action selection policy in the non-terminal state in the current iteration.
 14. The non-transitory, computer-readable medium of claim 8, wherein the weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV or the baseline-corrected CFV of the execution device in the non-terminal state comprises a sum of: the CFV baseline of the execution device in the non-terminal state of the previous iteration weighted by a scalar; and the CFV or the baseline-corrected CFV of the execution device in the non-terminal state weighted by a second scalar and a probability of considering the non-terminal state.
 15. A computer-implemented system, comprising: one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform one or more operations comprising: in a current iteration of a plurality of iterations, computing a counterfactual value (CFV) of an execution device in a terminal state of completing a task based on a payoff of the execution device at the terminal state and a reach probability of one or more other devices reaching the terminal state, wherein the terminal state results from a sequence of actions taken at a plurality of non-terminal states by the execution device and by the one or more other devices, wherein each of the plurality of non-terminal states has one or more child states; computing a baseline-corrected CFV of the execution device in the terminal state based on the CFV of the execution device in the terminal state, a CFV baseline of the execution device in the terminal state of a previous iteration, or the CFV of the execution device in the terminal state and the CFV baseline of the execution device in the terminal state of the previous iteration; for each of the non-terminal states and starting from a non-terminal state that has the terminal state and one or more other terminal states as child states: computing a CFV of the execution device in the non-terminal state based on a weighted sum of baseline-corrected CFVs of the execution device in the child states of the non-terminal state; computing a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration; computing a CFV baseline of the execution device in the non-terminal state of the current iteration based on a weighted sum of the CFV baseline of the execution device in the non-terminal state of the previous iteration and the CFV of the execution device in the non-terminal state or the baseline-corrected CFV of the execution device in the non-terminal state; and determining an action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration.
 16. The computer-implemented system of claim 15, further comprising, in response to determining that a convergence condition is met, controlling operations of the execution device in the non-terminal state based on the action selection policy in the non-terminal state for the next iteration.
 17. The computer-implemented system of claim 15, wherein determining the action selection policy in the non-terminal state for the next iteration based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration comprises: calculating a regret value based on the baseline-corrected CFV of the execution device in the non-terminal state of the current iteration; and determining the action selection policy in the non-terminal state for the next iteration based on the regret value according to regret matching.
 18. The computer-implemented system of claim 15, wherein the reach probability of the one or more other devices reaching the terminal state comprises a product of probabilities of actions taken by the one or more other devices reach the terminal state.
 19. The computer-implemented system of claim 15, wherein computing a baseline-corrected CFV of the execution device in the non-terminal state based on the CFV of the execution device in the non-terminal state, a CFV baseline of the execution device in the non-terminal state of a previous iteration, or the CFV of the execution device in the non-terminal state and the CFV baseline of the execution device in the non-terminal state of the previous iteration comprises: computing a sampled CFV baseline of the execution device that takes an action in the terminal state of the previous iteration based on the CFV baseline of the execution device in the terminal state of the previous iteration, a sampling policy of the execution device that takes the action in the terminal state of the previous iteration, and a probability of reaching the terminal state that results from the sequence of actions taken by the execution device; in response to determining that the action is sampled, computing a baseline-corrected CFV of the execution device that takes the action in the non-terminal state based on the CFV of the execution device in the non-terminal state and the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration; and in response to determining that the action is not sampled, using the sampled CFV baseline of the execution device that takes the action in the terminal state of the previous iteration as the baseline-corrected CFV of the execution device in the non-terminal state.
 20. The computer-implemented system of claim 15, wherein the weighted sum of the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states is computed based on the baseline-corrected CFV of the execution device in the terminal state and corresponding baseline-corrected CFVs of the execution device in the one or more other terminal states weighted by an action selection policy in the non-terminal state in the current iteration. 